7.2 Analysis - eriksvensson
Transcription
7.2 Analysis - eriksvensson
Examensarbete LITH-ITN-MT-EX--04/044--SE Semi-automatic Music Creation using the Continuous Wavelet Transform and Markov Chains Thomas Björkvald & Erik Svensson 2004-05-28 Department of Science and Technology Linköpings Universitet SE-601 74 Norrköping, Sweden Institutionen för teknik och naturvetenskap Linköpings Universitet 601 74 Norrköping LITH-ITN-MT-EX--04/044--SE Semi-automatic Music Creation using the Continuous Wavelet Transform and Markov Chains Examensarbete utfört i Medieteknik vid Linköpings Tekniska Högskola, Campus Norrköping Thomas Björkvald & Erik Svensson Handledare: Niklas Rönnberg Examinator: Björn Kruse Norrköping 2004-05-28 Datum Date Avdelning, Institution Division, Department Institutionen för teknik och naturvetenskap 2004-05-28 Department of Science and Technology Språk Language Svenska/Swedish Engelska/English Rapporttyp Report category Examensarbete B-uppsats C-uppsats D-uppsats ISBN _____________________________________________________ ISRN LITH-ITN-MT-EX--04/044--SE _________________________________________________________________ Serietitel och serienummer ISSN Title of series, numbering ___________________________________ _ ________________ _ ________________ URL för elektronisk version http://www.ep.liu.se/exjobb/itn/2004/mt/044/ Titel Title Semi-automatic Music Creation using the Continuous Wavelet Transform and Markov Chains Författare Author Thomas Björkvald & Erik Svensson Sammanfattning Abstract A common opinion these days is that every song on the radio sounds exactly the same. It is almost as if they were algorithmically made from standardised templates. But could an automatised artificial composer really compete with the human alternative? This thesis seeks the answer to this question, and describes the implementation of a semi-automatic composer application. The approach is to imitate the composing style of humans through analysis of the characteristics of existing sound material and synthesis using the statistics obtained in the analysis. Important aspects of the work include deciding which information is possible and realistic to extract from sound files, and a reliable statistical model allowing the synthesis to produce unique results using only the properties of the input(s). Classic Fourier analysis presents a straightforward way of determining important characteristics of signals. More recent techniques, such as the wavelet transform, offer new possibilities in the analysis, taking the complex research field of sound and music to another level. Markov chains provide a powerful tool for identifying structural similarities of patterns, which can be used to synthesise new material. Nyckelord Keyword Music analysis, music synthesis, digital signal processing, continuous wavelet transform, Fourier analysis, Markov chains, human behaviour. Björkvald, Svensson 2004 Abstract A common opinion these days is that every song on the radio sounds exactly the same. It is almost as if they were algorithmically made from standardised templates. But could an automatised artificial composer really compete with the human alternative? This thesis seeks the answer to this question, and describes the implementation of a semi-automatic composer application. The approach is to imitate the composing style of humans through analysis of the characteristics of existing sound material and synthesis using the statistics obtained in the analysis. Important aspects of the work include deciding which information is possible and realistic to extract from sound files, and a reliable statistical model allowing the synthesis to produce unique results using only the properties of the input(s). Classic Fourier analysis presents a straightforward way of determining important characteristics of signals. More recent techniques, such as the wavelet transform, offer new possibilities in the analysis, taking the complex research field of sound and music to another level. Markov chains provide a powerful tool for identifying structural similarities of patterns, which can be used to synthesise new material. Sammanfattning En vanlig åsikt är att alla sånger som spelas på radio låter exakt likadant. Det är precis som om de vore gjorda på algoritmisk väg utifrån färdigställda mallar. Men skulle en automatiserad artificiell kompositör på allvar kunna mäta sig mot det mänskliga alternativet? Detta examensarbete syftar till att besvara denna fråga, och inkluderar implementationen av en halvautomatisk applikation för komposition. Tillvägagångssättet är att imitera mänskligt komponerande genom att analysera befintligt ljudmaterial och att använda dess egenskaper för syntetisering. Viktiga frågeställningar är bland annat vilken information som är möjlig och realistisk att extrahera från ljudfiler, och hur en pålitlig statistisk modell som tillåter syntesen att producera unika resultat, enbart utifrån inmatningsdatan, kan skapas. Klassisk fourieranalys tillhandahåller sätt att hitta viktiga egenskaper i signaler. Senare tekniker, såsom wavelettransformen, erbjuder nya analysmöjligheter och för det komplexa forskningsområdet inom ljud och musik till andra nivåer. Markovkedjor är ett kraftfullt verktyg för identifiering av strukturella likheter i mönster, vilket kan nyttjas vid syntetisering. Keywords Music analysis, music synthesis, digital signal processing, continuous wavelet transform, Fourier analysis, Markov chains, human behaviour. i Björkvald, Svensson 2004 Table of contents 1 2 3 4 Introduction ...................................................................................................................................1 1.1 Background .................................................................................................................................1 1.2 Thesis objectives ........................................................................................................................1 1.3 Nature of the thesis....................................................................................................................1 1.4 Problems......................................................................................................................................2 1.5 Thesis outline..............................................................................................................................2 Previous work................................................................................................................................3 2.1 Introduction ................................................................................................................................3 2.2 Analysis........................................................................................................................................3 2.3 Synthesis ......................................................................................................................................4 2.4 Conclusion ..................................................................................................................................4 Method ..............................................................................................................................................5 3.1 Introduction ................................................................................................................................5 3.2 Analysis........................................................................................................................................5 3.3 Synthesis ......................................................................................................................................5 3.4 Conclusion ..................................................................................................................................5 Brief music theory ......................................................................................................................6 4.1 Introduction ................................................................................................................................6 4.2 Tones and octaves......................................................................................................................6 4.2.1 4.3 Frequency ranges........................................................................................................................7 4.4 Harmonics...................................................................................................................................7 4.5 Chords and scales.......................................................................................................................8 4.5.1 Scales.........................................................................................................................................8 4.5.2 Some simple rules .......................................................................................................................9 4.5.3 Chord table ................................................................................................................................9 4.6 5 Tone frequency calculation...........................................................................................................7 Conclusion ................................................................................................................................10 Analysis ...........................................................................................................................................11 5.1 Introduction ..............................................................................................................................11 5.2 Digital Signal Processing (DSP) .............................................................................................11 5.2.1 Time and frequency domain ......................................................................................................12 5.2.2 Sampling issues ........................................................................................................................12 5.2.3 Aliasing...................................................................................................................................13 ii Björkvald, Svensson 2004 5.2.4 Filtering ...................................................................................................................................14 5.2.5 Convolution..............................................................................................................................14 5.2.6 Filtering in the frequency domain ..............................................................................................15 5.2.7 Compressor ..............................................................................................................................16 5.3 Frequency analysis....................................................................................................................17 5.3.1 Fourier Transform (FT)...........................................................................................................17 5.3.2 Discrete Fourier Transform (DFT) ..........................................................................................19 5.3.3 Fast Fourier Transform (FFT) ................................................................................................19 5.3.4 Time-frequency problem ............................................................................................................19 5.3.5 Short Time Fourier Transform (STFT) ...................................................................................19 5.3.6 Resolution problem ...................................................................................................................20 5.4 Wavelets.....................................................................................................................................21 5.4.1 5.5 The wavelet theory...................................................................................................................22 5.5.1 Continuous Wavelet Transform (CWT) ...................................................................................23 5.5.2 CWT in the frequency domain..................................................................................................23 5.5.3 Visualisation ...........................................................................................................................23 5.5.4 Discretisation of the CWT .......................................................................................................24 5.5.5 More sparse discretisation of the CWT .....................................................................................24 5.5.6 Sub-band coding.......................................................................................................................24 5.5.7 Discrete Wavelet Transform (DWT)........................................................................................25 5.5.8 Wavelet families .......................................................................................................................25 5.5.9 Conditions for wavelets .............................................................................................................27 5.5.10 Wavelets and music..............................................................................................................27 5.5.11 The Morlet wavelet ..............................................................................................................28 5.6 6 Wavelet history.........................................................................................................................21 Conclusion ................................................................................................................................29 Synthesis ........................................................................................................................................30 6.1 Introduction ..............................................................................................................................30 6.2 Markov chains...........................................................................................................................30 6.2.1 Statistical model .......................................................................................................................30 6.2.2 Markov chain example.............................................................................................................31 6.3 Artificial Intelligence (AI) .......................................................................................................32 6.3.1 The AI research field................................................................................................................32 6.3.2 Simulating human behaviour ....................................................................................................33 6.3.3 AI and music...........................................................................................................................33 6.4 MIDI..........................................................................................................................................33 iii Björkvald, Svensson 2004 6.4.1 MIDI history...........................................................................................................................33 6.4.2 The MIDI commands ..............................................................................................................34 6.4.3 Standard MIDI file format ......................................................................................................35 6.4.4 MIDI file example...................................................................................................................36 6.5 7 Conclusion ................................................................................................................................36 Implementation .........................................................................................................................37 7.1 Introduction ..............................................................................................................................37 7.2 Analysis......................................................................................................................................37 7.2.1 CWT analysis .........................................................................................................................37 7.2.2 Improving performance..............................................................................................................40 7.2.3 Fourier spectra .........................................................................................................................41 7.2.4 Normalisation and compressor usage .........................................................................................41 7.2.5 Downsampling .........................................................................................................................42 7.2.6 Octave-wise analysis .................................................................................................................42 7.2.7 Binary threshold .......................................................................................................................44 7.2.8 Holefilling ................................................................................................................................45 7.2.9 Event matrix ...........................................................................................................................46 7.2.10 7.3 Markov model..........................................................................................................................48 7.3.2 Prefix length.............................................................................................................................50 7.3.3 Tone sequence analysis example ................................................................................................50 7.3.4 Creation of new tone sequences ..................................................................................................52 7.3.5 Controlling the characteristics of the output ................................................................................53 7.4.1 9 Synthesis ....................................................................................................................................47 7.3.1 7.4 8 Storing the results ................................................................................................................46 MIDI representation................................................................................................................54 Writing the MIDI format ........................................................................................................54 7.5 Application process flow.........................................................................................................55 7.6 Conclusion ................................................................................................................................56 Results .............................................................................................................................................57 8.1 Analysis......................................................................................................................................57 8.2 Synthesis ....................................................................................................................................57 8.3 Application screenshots ..........................................................................................................57 Conclusions ..................................................................................................................................60 9.1 Problem formulations..............................................................................................................60 9.2 Limitations ................................................................................................................................61 iv Björkvald, Svensson 2004 9.2.1 No storing of simultaneous tones ...............................................................................................61 9.2.2 No instrument identification or separation.................................................................................61 9.2.3 MIDI for playback ..................................................................................................................61 9.2.4 No beat detection......................................................................................................................61 9.2.5 Combining different inputs........................................................................................................62 9.3 Thesis separation......................................................................................................................62 9.4 Artificial intelligence aspects...................................................................................................62 9.5 Music theory aspects................................................................................................................63 9.6 Final comments ........................................................................................................................63 10 Future work .............................................................................................................................64 10.1 Improving performance ..........................................................................................................64 10.2 Improving the features of the analysis ..................................................................................64 10.3 Extending the statistical model ..............................................................................................65 10.4 Enhancing the realism of the synthesis.................................................................................65 11 Closing thoughts .......................................................................................................................66 12 Bibliography ............................................................................................................................68 12.1 Literature ...................................................................................................................................68 12.2 Web ............................................................................................................................................68 v Introduction Björkvald, Svensson 2004 1 Introduction This thesis has been planned and performed by two students currently finalising their M.Sc. in Media Technology at Linköping University, Sweden. The education combines classic engineering courses with more modern areas, such as computer graphics and digital video. While neither of us has any major musical knowledge (besides strumming out the odd guitar chord now and then), the one thing we do have in common is our compulsive obsessive record collecting and interest in all things related to the music scene, be it concerts, gossip or reviews. 1.1 Background In the world of media technology, the emphasis is mostly put on computer graphics in one way or another. Therefore, a deeper insight into the research area of sound and music is desirable. Some of the methods used in computer graphics, for example the texture synthesis theories in the image-based rendering field, would be interesting to apply in sound-related areas. The concept of Markov chains has been used to statistically generate new note sequences from well-known symphonies. A natural development of these principles is to combine them with signal processing and analysis theories, and thereby be able to use recorded music as source for the Markov chain (or any other suitable statistical method). 1.2 Thesis objectives The purpose of this thesis is to generate new music from existing sound material by using frequency analysis and the statistical properties of the analysed information. A simple database containing characteristics from the (several) input sources has to be built. The desired characteristics are for example beat and tonal sequence. A decision engine will then produce new music using statistical methods. What this means is that the output will in fact be based on the characteristics of all the different inputs, but will still be something completely unique. This work could be reconnected with the area of computer graphics by synthesising ”mood” music for virtual environments without having to hire a composer or buying expensive copyrighted material. However, the most obvious usage of these ideas would be to create a program where a great number of “super-hits” are used as input and a new, completely original multiplatinum hit single is the output. Another aspect of this thesis is the challenge of creating an artificial composer without a genuine knowledge of music theory. Can the result be enjoyable music or will the engineering approach prove itself inadequate? 1.3 Nature of the thesis While the dream result is a software application good enough to compete with the most acclaimed songwriters of today, consideration has to be given to the extreme complexity of sound and music. Even the simplest piano tune holds a vast wealth of physical properties. Due to this, and the lack of well-known projects of similar kind, the thesis will have to be viewed upon as an experiment with no real given end product. It is not the final result that is important, but rather the insight into an interesting area of science, gained during the process. After all, this is the true essence of the engineering spirit. That being said, the plan is of course to fulfil the thesis objectives, even if it means to simplify them. 1 Introduction Björkvald, Svensson 2004 1.4 Problems There are three important aspects that will be considered in this thesis: What sort of information is possible to extract from an arbitrary piece of recorded music? The term “information” here includes musical characteristics like tone, tempo etc. A natural first step is to try to analyse a simple tonal sequence, with just one instrument playing, one tone at a time. The problem here is to isolate the fundamental frequencies, and identify the tones, in other words the melody. Appropriate methods can then be used on a larger scale with more instruments playing simultaneously. This results in more complex soundscapes, where it will be necessary to find a way of distinguishing and getting rid of all redundant information (for example the vocals). How must the extracted information be represented to be storable? A suitable representation form needs to be designed, in order to properly store the information. Here it is absolutely necessary to decide in detail what is to be stored. Is it just the different tones or sequences of tones? How should chords be treated? What about the characteristics that make a certain instrument sound special (i.e. harmonics etc.)? Consideration also has to be given to the aspects of time signature and tempo, should they be stored at all, or should they be decided manually when combined? Issues here could be for example if two input melodies have different time signature and/or tempo. Should they be combined at all in this case? How can the stored information be used to synthesise new material? The information must be analysed using statistical methods, and combined with a decision engine to create new patterns of music. The idea implies that the more the input, the more original results will be obtained, since the decision engine will have more information to work with. 1.5 Thesis outline This thesis consists of two major threads; sound analysis and sound synthesis, which are clearly separated in each chapter. The sound analysis parts consider the aspects of extracting information from sounds produced by musical instruments. The sound synthesis parts then deal with using this extracted information to produce new music. To separate the theories and methods used from how the practical problems were actually solved; all theories used are explained thoroughly in chapters 4 through 6. The implementation chapter 7 describes how the theories were used in practice. The result and conclusion chapters 8 and 9 sum up the knowledge obtained along the way, and address the original thesis objectives and problems declared in sections 1.2 and 1.4, respectively. 2 Previous work Björkvald, Svensson 2004 2 Previous work 2.1 Introduction In order to properly understand the thesis problems and what to make of them, a thorough study of previous work in the fields of sound analysis and synthesis has to be done. Of interest are not only books and articles, but also websites and software. 2.2 Analysis Commercial “wav2midi”-software has been available for some time. The purpose of this type of application is to analyse sampled music in wave-format, and then convert it into MIDI. An example is Recognisoft’s Solo Explorer. This application, along with several other similar programs, has proved to be quite successful in completing their tasks. However, as the name “Solo Explorer” implies, the program only works for a melody line played by a single instrument. [30] Jehan, a Ph.D. Candidate at the Media Laboratory of MIT, Boston, has written a thesis about the analysis and synthesis of musical signals. The emphasis here is put on segmentation and frequency estimation. The term segmentation refers to the analysis of so-called musical events, which is a property that can describe a number of features of the musical tone. For instance, it can appear as a vowel change in a sung melody, or the percussive sound of a drum. By analysing the musical events, a lot can be said about the music. Jehan explains two different methods for the segmentation part; a Fourier-based frequency analysis, which involves normalisation of the energy of the signal, and a statistically based method. The frequency estimation part of Jehan’s thesis involves filtering the signal with wavelet-based filters, and then investigating the zerocrossings in order to calculate the frequency. [3] Chapman, who wrote his Ph.D. thesis at the Meteorology Department at the University of Reading, has created a small Matlab-based program called Multiana. This program was originally created for determination of melodies of a guitar and a harmonica in a short piece of music, using wavelets. It plots the result in a graph containing five octaves of notes, and the resulting coefficients of the wavelet analysis. This way, it graphically visualises the melodic content of the music signal. [12] In his B.Sc. thesis for the University of Sheffield, Self describes the wavelet transform in depth, and how it can be used for music analysis purposes. He also explains how a visualisation of the transform result can be created, in what is usually called scalograms. [11] In their article for SIAM Review, Alm et al. describes several different approaches for analysing the content of sounds produced by musical instruments. The most promising technique provided is the use of the Continuous Wavelet Transform (CWT) for calculating the scalogram of sound. Alm et al. also introduces the use of complex wavelets, which because of their oscillatory nature have similar shape to sound waves from common harmonic instruments. [1] Matlab’s Wavelet Toolbox has a built in function for performing the CWT analysis, described in the documentation [21]. From a sound and particularly musical analysis point of view, it is the possibility to use complex Morlet wavelets directly on a signal with just a couple of lines of code that is the most appealing feature of the Wavelet Toolbox [20]. 3 Previous work Björkvald, Svensson 2004 2.3 Synthesis On his webpage, Kesteloot presents a very interesting example of the possibilities of using Markov chains to synthesise written language. The application takes two texts of different languages as input, and builds one Markov chain for each of the texts. By using the probabilities of the analysed inputs, and interpolating the generated output from one of the Markov chains to the other, the result is a completely unique text, switching from looking like the first language to looking like the second one. The application also involves choosing a so-called prefix length, which decides how many letters are grouped together in the input analysis. A high value results in realistic outputs, while small prefix numbers generate “mumbo-jumbo” sentences. [16] Another interesting application of Markov chains is the SynText plugin for Microsoft Word. This plugin creates Markov chains from text documents, and can be used to find reoccurring patterns and repetitions. [25] Various people have also used Markov chains to compose musical patterns. Stanford alumni Maurer presents a brief summary of the field of algorithmic composition on his webpage [24]. In a project report for the University of Derby, Mosley describes an application that correlates an ordinary sine wave at different frequencies with the waveform being analysed. It is in fact a simplified version of the idea of wavelets he uses – without ever mentioning the word “wavelet”. However, the most interesting part of Mosley’s thesis is the MIDI writing routine he provides. Especially useful are his methods for converting time (in samples) into terms of clicks per quarter note and tempo, and rewriting this time information into binary format. Both these transformations are necessary for the notes to be played correctly by the MIDI player of choice. [6] 2.4 Conclusion Evidently, research has been done in the areas of music analysis and synthesis. A common denominator, however, is not as obvious. Consequently, effort has to be put into deciding which of the relevant techniques that are to be studied closer. 4 Method Björkvald, Svensson 2004 3 Method 3.1 Introduction In order to fulfil the thesis objectives, the separate parts each needed a suitable solution to their problems. After all, this is an experimental thesis with no given starting points. Thus, the study and selection of relevant material was a crucial part of the work. This became particularly important when the chosen theories were to be turned into code and had to be used throughout the whole implementation process. Changing theory in some part of the code along the way could in fact result in having to rewrite all of it. 3.2 Analysis The intention of the analysis part of the thesis was to analyse sampled audio data and determine the notes played in the music file. The wav2midi-type of software, made for this exact purpose, was therefore clearly interesting at first. But since all of these programs were commercial products, the source code or even a brief explanation of the ideas behind them was more or less impossible to find. Moreover, these applications generally only worked for a melody line played by a single instrument. This was not acceptable as it seriously limited the thesis objectives. Therefore, the wav2midi-principle was quickly abandoned. Multiana presented the concept of wavelets and a phenomenon referred to as multi-resolution analysis, which became the basis of the analysis part of this thesis. In particular, the drawing function of the program, plotting the content of the music for each and every note ranging over five octaves, was impressive. Sadly, the code for Multiana was far from efficient, but at least it clearly showed the possibilities and power of complex wavelets in the area of music analysis. With this new way of looking at the analysis problems, the focus was put on studying the theoretical nature of the wavelet analysis, how it differed from Fourier analysis, and especially how it could be used to analyse the sampled audio data. The most important feature of this form of analysis was that unlike the traditional Fourier analysis, which simply identified the frequency content of a signal without localising it in time, it also involved the time aspect. All things considered, wavelets were the obvious choice of method for the sound analysis. 3.3 Synthesis The use of wavelet analysis was chosen after a thorough study of the research field of music analysis. Markov chains on the other hand, were more or less decided upon from the very beginning, when this thesis was merely an idea toyed with in the back of the authors’ minds. Ever since hearing about interesting projects involving Markov chains, the opportunity to learn more about this technique had been desired. Experimenting with various Markov-based applications further strengthened this gut feeling. 3.4 Conclusion Even if engineering techniques such as wavelets and Markov chains could be decided upon, they were not of much use without knowing what to look for in the analysis. These somewhat general methods needed to be used in conjunction with classical music theory. 5 Brief music theory Björkvald, Svensson 2004 4 Brief music theory 4.1 Introduction For a thesis of this nature, a brief insight into the fascinating world of music was essential. Even though the work was performed in an engineering spirit, it still involved simple rules and features of traditional western music. The concept of naming tones and octaves comes down to the fact that all musical sounds are about frequency. Koniaris’ Understanding Notes and their Notation webpage [18] influenced much of the discussions in this chapter. 4.2 Tones and octaves The tonal system is divided into octaves and semitones. Usually, it is claimed that eleven octaves are sufficient to cover the human hearing, where each one of the octaves consists of twelve semitones. The frequency range of an octave is between f and 2f, meaning that the same semitone in the next octave has the exact double frequency as the current one. The frequency of a tone being played is called the fundamental frequency. In order to name the semitones in a distinguishable way, the octave number is written directly after the corresponding semitone. For instance, an A-tone in the fourth octave would be denoted “A4”. Originally, western music was defined as consisting of seven distinct notes: C, D, E, F, G, A, and B (or H as it is referred to in some European countries). How does this work with the twelvesemitone octave system mentioned above? The answer is quite simple; over time it was realised that more semitones were needed, in order to enable the transcription of pieces so that it would match the limited tonal resources of singers. To avoid having to rewrite all existing sheet music, the new semitones, located in between the original tones, were denoted with “#” (pronounced “sharp”) or “b”1 (pronounced “flat”). The “#” was used to raise a tone, while the “b” was used to lower it. This means that “A#” and “Bb” is in fact the same semitone. Semitone Notation 0 C 1 C# or Db 2 D 3 D# or Eb 4 E 5 F 6 F# or Gb 7 G 8 G# or Ab 9 A 10 A# or Bb 11 B or Cb Table 4.1: Semitone order. 1 ”b” is actually not an entirely correct notation, but rather a ”computerised” version of the real sign for flat notes. 6 Brief music theory Björkvald, Svensson 2004 4.2.1 Tone frequency calculation With the octave system it is not difficult to calculate the frequency of any given tone. A formula can be defined as in (4.1). f = f 0 ⋅ 2 note / 12 (4.1) f0 is the frequency of the C0 note (~ 16.3516 Hz) and note is the index of the current semitone (the index of f0 is zero). Using this formula, any given tone can be expressed as a frequency. To exemplify this, consider the A4 tone. With the index system, this semitone has a number of 57. The formula (4.1) then yields f = 16.3516 ⋅ 257 / 12 ≈ 440 Hz. 4.3 Frequency ranges With the human hearing usually divided into eleven octaves, it is interesting to examine in what ranges popular instruments appear. The instrument spreading most over the spectrum is the piano, which can in fact appear all the way from the lowest (zero:th) to the eighth octave. While this offers a high freedom when it comes to playing the instrument, it also makes it difficult to analyse, because it appears in regions of the frequency spectra where other instruments reside. Looking at a standard tuned guitar for instance, the lowest possible note is an E2 and the highest one is E6. All of these notes can also be played on a piano. How is it possible to identify which of the instruments that is actually playing a certain melody? The frequency ranges for some of the most popular instruments are listed in Table 4.2. Instrument Piano Guitar Bass guitar Cello Violin Lowest tone A0 E2 B0 C2 G3 Highest tone C8 E6 D4 A6 E7 Table 4.2: Frequency ranges of popular instruments. 4.4 Harmonics The property that makes instruments sound differently is the phenomenon of harmonics. When a note is played, it is not just the fundamental frequency of the note that is heard, but rather a combination of this frequency together with a unique number of other semitones, the harmonics. Consider an E2 being played on a guitar. Besides the fundamental frequency of the E2 tone, the following harmonics appear: • • • • • • • + 12 semitones + 19 semitones + 24 semitones + 27.86 semitones + 31 semitones + 33.69 semitones + 36 semitones (one octave higher, i.e. E3) (B3) (E4) (somewhere between G4 and G4#) (B4) (somewhere between C#5 and D5) (E5) 7 Brief music theory Björkvald, Svensson 2004 These notes, often the same semitone as the fundamental tone but in higher octaves, are played simultaneously, regardless of what instrument is being used. What makes the instruments sound differently is the fact that the harmonics appear with different strengths depending on the instrument. The harmonics are always weaker than the fundamental, and they always appear in higher octaves. 4.5 Chords and scales Individual notes are nice on their own, but in “real” music they are often combined into chords. Mugglin likens the musical language to the alphabet. The base is then the so-called scale. Each note is defined as a letter, and by putting notes together from the scale, chords (words) are created. Next, the words are put together to form phrases (musical sentences). Just like with written language, knowing words is not the same as knowing the language. A natural understanding of how the words fit together is essential. [28] 4.5.1 Scales There are many different scales, or languages of music. The simplest one is called the major scale. It is obtained by playing a certain combination of tones: { start, tone, tone, semitone, tone, tone, tone, semitone } “Tone” means skipping one semitone (and thereby playing the next full note), and “semitone” means just that; playing the next semitone. By choosing starting note, the scale takes on different shapes. The C major scale has the nice feature of consisting of all white keys in a keyboard octave, i.e. no sharps or flats (see Table 4.3). start C tone D tone E semitone F tone G tone A tone B semitone C Table 4.3: C major scale. Notice how the F tone is the semitone after the E tone, and how the C tone follows the B in the same way. For simplicity, the different notes of the scale are usually numbered, as in Table 4.4. C 1 D 2 E 3 F 4 G 5 A 6 B 7 C 1 Table 4.4: Numbers of the major scale. From the major scale, one chord made up of several notes can be created for each note. The chord based on note 1 (in the C major scale the C note) is called the “one chord”. Assigning roman numbers to the different chords, a relationship table can be written (see Table 4.5). 8 Brief music theory Björkvald, Svensson 2004 Note 1 2 3 4 5 6 Chord I ii iii IV V vi Table 4.5: Chords derived from the major scale. The chord based on note 7 is a bit tricky, and left out of this brief chord theory. Some of the chords are defined as lowercase letters, while others are uppercase. The lowercase chords are called minor chords, and their sound is often described as darker and sadder compared to the uppercase ones. In the same way, chords structures can be built for a variety of different scales. Some examples of other scales are natural minor, harmonic minor and melodic minor. 4.5.2 Some simple rules Mugglin states some simple rules for creating an enjoyable song: • • • Starting and ending the song on the same note or chord makes an establishment for the song’s beginning and end, giving the melody a more finished sound. Do not use too many chords. Many songs have been written using only three chords; I, IV, and V. Choosing a key, which means to decide the note 1. For instance, if a D note is selected as 1, the song is played in the key of D and so forth. The selection of note 1 changes the content of the major scale. 4.5.3 Chord table All chords I to vi for the different notes of a major scale, in any key, can be summed in Table 4.6. Key I ii iii IV V vi C C Dm Em F G Am C# C# D#m Fm F# G# A#m D D Em F#m G A Bm D# D# Fm Gm G#m A#m Cm E E F#m G#m A B C#m F F Gm Am A# C Dm F# F# G#m A#m B C# D#m G G Am Bm C D Em G# G# A#m Cm C# D# Fm A A Bm C#m D E F#m A# A# Cm Dm D# F Gm B B C#m D#m E F# G#m Table 4.6: Full chord table. So, for each key there are six (in reality seven) different chords. How does one know which fit and sound good together? Generally, the seven chords of a key sound quite good together, since 9 Brief music theory Björkvald, Svensson 2004 they originate from the same scale. However, Mugglin proposes a simple map, showing how chord sequences can be chosen to sound natural. When listening to a song, one is constantly trying to guess which chord comes next. This fact can be used whilst composing; the listener wants to be able to guess the next chord, but at the same time he/she wants to be surprised sometimes. By using the map, and deliberately “fooling” the listener on occasions, playing a chord that is not the natural choice of successor, the music created can be varied enough to be exciting, but at the same time regular enough to be comforting. 4.6 Conclusion The tone and octave system of western music is built entirely on the frequency characteristics of tones. By naming twelve semitones in each octave, the human range of hearing can be represented. An important phenomenon is the concept of harmonics, making different instruments sound differently. Scales can be seen as the musical language, and by defining different scales, chords and different ways of combining them can be derived. Without this knowledge, it is more or less impossible to direct the analysis towards the desired results. 10 Analysis Björkvald, Svensson 2004 5 Analysis 5.1 Introduction The purpose of the analysis part of the thesis is to examine sampled signals, in the form of music. Thus, this section introduces more or less well-known theories relating to sound and signals; Digital Signal Processing (DSP), Fourier analysis and the wavelet theory. Much of the signal theory is based on Fundamentals of signals and system using MATLAB by Kamen et al. [4]. 5.2 Digital Signal Processing (DSP) The concept of signals and systems is a very general field of study, and its applications can be found virtually everywhere, from home appliances to advanced engineering innovations. Sound is an area well suited to incorporate the signal theory; speech as well as music can be described as continuous signals. For every time instant the signal has a corresponding amplitude value. Figure 5.1: 440 Hz signal; the A4 tone. Continuous signals are sometimes referred to as analogue. However, in order to be able to work with the analogue signals in a computer environment, the signals need to be sampled, yielding a discrete version of the signal. Sampling an analogue signal simply means to pick out every n:th value, thereby discretisising the time variable. Thus, the signal is no longer based on infinitely many time values. The sampling rate states how many times per second the signal amplitude is read, and is also called the sampling frequency. 11 Analysis Björkvald, Svensson 2004 Figure 5.2: Sampled signal. 5.2.1 Time and frequency domain Traditionally, a signal can be described in two different domains: the time domain and the frequency domain. The information given in the respective domains is the same, but presented in different ways; the time domain consists of signal amplitude values for each time instant, while the frequency domain contains the magnitudes of all frequencies without time information. Figure 5.3: Time (left) and frequency domain (right) representations of a 440 Hz sinus signal. 5.2.2 Sampling issues The sampling process has one important limitation. If the signal has a highest frequency of f, it needs to be sampled with a minimum frequency of 2f. This is called the sampling theorem, and the minimum sampling frequency is often referred to as the Nyqvist sampling frequency. If the sampling theorem requirement is not met, i.e. if the sampling frequency is too low, the signal cannot be correctly reconstructed from the sampled version. This is a phenomenon referred to as aliasing. 12 Analysis Björkvald, Svensson 2004 Figure 5.4: 1440 Hz signal sampled at 1000 Hz, which is below the Nyqvist frequency of 2880 Hz. This results in the reconstructed signal having a 440 Hz frequency. 5.2.3 Aliasing Figure 5.3 is not entirely truthful, because it only shows the first half of the frequency spectra. When transforming a signal from the time domain to the frequency equivalent, half the signal will be redundant information. All information above half the sampling frequency will be a reflection of the content below it. When sampling with a too low sampling frequency, unwanted frequency components occurs among the real frequencies. This is due to the reflection being “pushed” into the first half of the spectra. Figure 5.5: FFT spectra without aliasing. The right peak is a reflection of the left since the content is mirrored at half the sampling rate, i.e. 500 Hz. The sampling frequency is 1000 Hz, which is above the Nyqvist frequency, for this 440 Hz signal. 13 Analysis Björkvald, Svensson 2004 Figure 5.6: The frequency spectra of the signal as in Figure 5.5 overlapping and causing aliasing, due to the sampling frequency being 700 Hz and below the Nyqvist frequency. A “ghost” frequency appears at ~ 265 Hz. A classic example of aliasing is the “wagon wheel effect”. Consider filming a spoked wheel spinning. If the camera registers the motion too slowly, i.e. with too low sampling rate, the reconstructed film sequence will actually show a wheel that appears to be spinning backwards. There are a couple of factors to consider in order to avoid aliasing: • • The sampling frequency needs to be equal to or higher than the Nyqvist frequency. By lowpass filtering the signal, removing the frequency content higher than that of interest before the sampling, the aliasing effect can be decreased. 5.2.4 Filtering Transforming an input signal x(t) into an output signal y(t) is called filtering. The purpose of this process can be, for instance, to remove certain unwanted frequencies of a signal. This is called bandpass filtering. By removing all frequencies over a certain threshold, the signal is lowpass filtered. The opposite treatment, removing all frequencies below the threshold is called highpass filtering. Filtering can be performed in two different ways; in the time domain by convolution and in the frequency domain by a simple scalar multiplication. 5.2.5 Convolution The behaviour of a linear, continuous, time-invariant system with input signal x(t) and output signal y(t) is described by the convolution integral (5.1). ∞ y (t ) = ∫ h(τ ) x(t − τ )dτ , or in short, −∞ Here, the signal h(t) is the filter. 14 y = h⊗x (5.1) Analysis Björkvald, Svensson 2004 When working with discrete-time systems, as with sampled signals on computers, the convolution integral becomes a convolution sum (5.2). y[n] = ∞ ∑ h[k ]x[n − k ] (5.2) k = −∞ What this means is that the convolution sum and integral express the amount of overlap of one function x, as it is shifted over another function h. Hence, convolution could be described as “blending” one function with another. At times when the two functions are much alike, the value of y is large, and naturally it is small when the functions match poorly. [23] Figure 5.7: Convolution of the signal x with the filter h. The green curve shows the convolution of the red and blue functions, with the vertical green line indicating the position in time. The grey region is the product h(τ)x(t - τ), its area being precisely the convolution at the given time instant. 5.2.6 Filtering in the frequency domain Although it is convenient to use convolution for filtering, it also introduces an annoying problem when working with large signals. Fact is that the computational burden increases rapidly with the lengths of the signal and filter. For recorded music, several minutes in length with high sample rate, the situation becomes unbearable. The number of calculations needed for convolution is Nx · Nh multiplications and (Nx - 1)(Nh - 1) additions, where Nx is the number of samples in the signal and Nh is the number of filter points. This can be compared with filtering in the frequency domain where (not counting the cost of transforming the signal) the same amount of calculations as the length of the signal (Nx) is needed. It might sound like filtering in the frequency domain is the magical answer to the signal processor’s prayers. And this may very well be true for larger signals like sound files, but for shorter ones, good old-fashioned convolution can actually be faster. An important difference between filtering in the two domains is the size of the filter. In the time domain, the filter size is set large enough to get good enough results (which of course depend on the application). In the frequency domain the filter size remains the same for every filter operation on a certain signal. The filter size is simply the length of the signal, as the filtering consists of scalar multiplication between the filter and the signal (5.3). 15 Analysis Björkvald, Svensson 2004 yˆ (ω ) = hˆ(ω ) xˆ (ω ) (5.3) The signals ŷ , x̂ and the filter ĥ have been transformed into the frequency domain and are thereby functions of frequency instead of time. If the filter is shorter than the signal (as usually is the case), it is padded with zeros during the transformation to achieve equal lengths. A more efficient procedure would be defining the filter directly in the frequency domain, saving the computational cost of transforming the filter. For filters changing the frequency characteristics of a signal, this can be quite intuitive since working in the frequency domain allows for direct manipulation of the frequency content. But there is no easy way to define filters changing the behaviour of a signal over time in terms of frequency operations. In this case, the transformation from the time domain to the frequency domain is basically necessary. Bandpass filtering is an example of a straightforward operation in the frequency domain, since this kind of filtering is nothing more than removing all unwanted frequencies. Here, the filter is just a binary vector, where ones mean “pass” and zeros mean “stop”. This might introduce some undesirable effects when the filtered signal is transformed back to the time domain and a smoother filter (e.g. a Gaussian) is usually preferred. 5.2.7 Compressor A compressor is used to reduce (compress) the dynamic range (e.g. variation in amplitude) of an input signal. By setting a threshold level, the compressor will attempt to maintain that level through turning down everything above it a certain ratio. As the ratio approaches infinity, the compressor turns into a limiter. All points of the signal having amplitude below the threshold are unaffected and all other points will have their amplitude compressed. [2] The effect of the compressor is that weak amplitudes are augmented and strong ones are attenuated, leaving the signal with more even sound levels. This procedure is often used to dampen peaks that are too high when recording music, avoiding for important but less accentuated sounds to be drowned in the mix. The compressor can also be a useful tool in signal processing, since it can make a noticeable difference for frequency analysis. Figure 5.8: The effect of a compressor. Using a 2:1 ratio, all signal content above the threshold is compressed to half the overflow. A n:1 ratio (n being large) gives the effect of a limiter since all content above the threshold is set to the threshold value. 16 Analysis Björkvald, Svensson 2004 5.3 Frequency analysis Analysis of sound and music is not in any way a new field of research. However, this is not the same as saying that it is well known. In order to decide exactly how to analyse signals and which factors need to be considered, a thorough study of the available techniques, their differences and similarities is absolutely necessary. For most engineers, the Fourier Transform is the natural tool for analysing signal content. But is it the most appropriate method for this thesis? Are there any other more suitable theories available? 5.3.1 Fourier Transform (FT) Few scientists have made such an impression on the scientific world in as Joseph Fourier. His discoveries have influenced applications in areas such as science, maths, engineering and, perhaps the most significant one, signal processing. The revolution started in the 1930:s with this simple claim from Fourier: “Any periodic function can be represented as a sum of sines and cosines.” The Fourier theories have since been recognised as not only a mathematical tool, but also a physical phenomenon. The sound and music analysis and synthesis are areas in which the Fourier research has become a cornerstone. The transformation between the time- and frequency domain using the FT is totally lossless, enabling transformations between the two domains without losing information. [11] Figure 5.9: Transformation between the different domains. This feature is useful in many applications, for instance the concept of image processing, where an image can be transformed into its frequency domain using the FT. Many of the filters used in image processing are frequency-based and the filtering is performed in the frequency domain. Any signal, which periodically repeats itself, can be described by a sum of well-defined sinusoidal (sin and cosine) functions. This can be expressed as the so-called Fourier series (5.4). f (t ) = ∞ 1 a 0 + ∑ (a k cos kt + bk sin kt ) 2 k =1 (5.4) The Fourier coefficients a0, ak and bk can be defined as in (5.5). 1 a0 = 2π 2π ∫ f (t )dt , a 0 k = 1 π 2π ∫ f (t ) cos(kt )dt , b k 0 = 1 π 2π ∫ f (t ) sin(kt )dt (5.5) 0 By multiplying sinusoidal functions with different amplitudes and adding them together, any periodic signal can be described. Thus, simple sine and cosine signals form the base of the approximated signal. An example of this procedure is given by Figure 5.10. 17 Analysis Björkvald, Svensson 2004 Figure 5.10: Signal constructed by adding sinusoidal waves with frequencies 440, 880 and 1320 Hz (the last one with all amplitudes shifted by 4). Only multiples of the fundamental frequency of the signal needs to be accounted for in the periodic case. Furthermore, in practice merely a finite number of harmonics (multiples) is needed to approximate f(t) with an acceptable error. Thus, the Fourier series is a sparse way of representing periodic signals. However, just being able to handle periodic signals is a big limitation. In reality, entirely periodic signals are not that common, and it is often necessary to approximate non-periodic ones. To do this, the Fourier series needs to incorporate all possible frequencies. The Fourier coefficients can then be described as one complex coefficient (5.6). 1 ck = 2π 2π ∫ f (t )e jkt dt (5.6) 0 This gives the complex Fourier series (5.7). ∞ f (t ) = ∑ c k e − jkt (5.7) −∞ Turning the sum into an integral gives the FT for any non-periodic signal (5.8). fˆ (ω ) = ∞ ∫ f (t )e − jωt dt (5.8) −∞ The inverse FT can be defined as in (5.9). 1 f (t ) = 2π ∞ ∫ fˆ (ω )e jωt dω (5.9) −∞ There are a number of different ways of using the FT, each having its own characteristics. 18 Analysis Björkvald, Svensson 2004 5.3.2 Discrete Fourier Transform (DFT) Since computers are discrete machines unable to calculate continuous functions with infinitely many values, they need a discrete implementation of the FT – the Discrete Fourier Transform. This is achieved by modifying the transform formula into a sum rather than an integral, and by restricting the frequency interval being used. The signal is then sampled at a regular grid of N points (5.10). N −1 F [k ] = ∑ f [n]e − j 2πkn / N , k = 0, 1, ..., N − 1 (5.10) n =0 Here, 2πk/N and n correspond to the continuous variables ω and t, respectively. 5.3.3 Fast Fourier Transform (FFT) The Fast Fourier Transform is an interesting and useful development of the DFT. Basically, it takes hold of the fact that the DFT does not need many coefficients to make an adequate analysis of a signal. The FFT is simply the DFT performed on fewer samples, and is therefore much faster. The FFT is implemented in many software applications, including Matlab. 5.3.4 Time-frequency problem The FT, DFT and FFT have one problem in common. Unfortunately, it is a rather large issue. All these different types of the Fourier transform can accurately distinguish the frequency contents of an arbitrary signal. However, what if the time location of the frequencies is of interest? None of the methods give any time information whatsoever. The result of this limitation is that the methods are only of interest when dealing with a stationary signal, built on constant frequencies, where the time aspect is not of interest. 5.3.5 Short Time Fourier Transform (STFT) In order to implement a time-dependency in the FT, and thereby making it able to not only estimate the frequencies, but also when they occur, the Short Time Fourier Transform (or Windowed Fourier Analysis) was developed. The principle of the STFT is to introduce a time parameter, so that the transformation is performed over a limited part of the signal, sort of like looking at the signal through a window. By choosing a size of the window which makes the signal in it practically stationary, the frequency can be estimated locally, and consequently over the entire signal. 19 Analysis Björkvald, Svensson 2004 Figure 5.11: The STFT principle. By moving the window over the signal, the local frequency inside the window can be approximated over the entire signal. The signal viewed in the window is represented by (5.11). x(t ) w(t − T ) (5.11) T is the time location of the centre point of the window and w(t) is the window function. Applying this to equation (5.8) yields the STFT formula (5.12). S (ω , T ) = ∞ ∫ x(t )w(t − T )e − jωt dt (5.12) −∞ This is in principle a convolution between the signal and the window, and the process can therefore be seen as filtering. 5.3.6 Resolution problem With the STFT being very dependent of the choice of the window form and size, its major disadvantage is that a fixed size of the window cannot suit all contents of the signal. A small window is good when approximating high frequency parts, but when trying to estimate a low frequency, the window is far too small to detect the oscillations. In the same way, a large window can be used to detect low frequency oscillations, but has no chance to detect rapid changes (i.e. high frequencies). Because of the fixed window size, the resolution of the STFT is poor. 20 Analysis Björkvald, Svensson 2004 Figure 5.12: The fixed window size of the STFT is unable to approximate different frequencies in the same signal. The red window at the time instant T covers one full period, but was unable to do so earlier in the signal (illustrated by the dashed grey window). 5.4 Wavelets The wavelet theory and principles were originally developed to address the shortcomings of the FT, and especially the STFT. Where the classic Fourier analysis offers the possibility to reveal the frequencies of a stationary signal, the wavelet analysis takes the idea one step further; not only does it determine which frequencies exist in the signal, but also when they occur. The improvement of the resolution for the frequency and time detection of the wavelet transform makes it the natural successor of the STFT. 5.4.1 Wavelet history Tracing exactly where, when and by whom the wavelet analysis theory was created is a difficult task. Yves Meyer, who was one of the first people to obtain public attention for his wavelet work, once made the following statement, which in short sums up the wavelet situation: “Tracing the history of wavelets is almost a job for an archeologist, I have found at least 15 distinct roots of the theory, some going back to the 1930’s” Meyer realised that a lot of people had actually been using the wavelet principles without knowing about it themselves [11]. The possibly biggest pioneer in the wavelet field was Jean Morlet, who was actually the first man to use the term “wavelet”. Around the year 1975, while working for an oil company, he realised that the techniques that were used for searching for underground oil could be improved. By sending impulses into the ground and analysing their echoes, it was possible to tell how thick a layer of underground oil was. Originally, Fourier analysis and especially STFT were used for this process, but since these techniques were very time-consuming, Morlet began searching for another solution. Looking at the STFT, Morlet decided that keeping the window size fixed was the wrong approach. Instead, he changed the window size while keeping the function (number of oscillations) fixed. This way, he discovered that stretching the window stretched the function and squeezing the window compressed the function. The foundation for wavelet theory was created, but Morlet was not satisfied yet. 21 Analysis Björkvald, Svensson 2004 In 1981 he teamed up with Alex Grossman, and together they worked on an idea that Morlet discovered while experimenting with a simple calculator. The idea was the transformation of a signal into wavelet form and back without losing any information, a lossless transformation between the time domain and the time-frequency domain. [17] Other famous wavelet people include Ingrid Daubechies, who has created one of the most commonly used families of wavelets, the Daubechies wavelets [29], Stephane Mallat, who collaborated with Yves Meyer, and Aldred Haar, who laborated with wavelet ideas as early as 1909, and also named a wavelet family. 5.5 The wavelet theory The wavelet analysis is based on the translation and dilation (scaling) of a so-called mother wavelet, ψ(t). The wavelet function can be described as in (5.13). ψ s ,τ (t ) = 1 s t −τ s ψ (5.13) 1 t −τ s is the scale factor, ψ is the mother wavelet, τ is the translation factor and the factor s s is for energy normalisation across the scales. [32] The procedure of a wavelet transform is straightforward; the wavelet, which is a scaled and dilated version of the mother wavelet, is convolved with the signal. The scale values can be likened to inverted frequencies and range from small to large values. No restrictions appear as to how many scales, or the spacing between them, that can be used in the transform. A large scale value stretches the mother wavelet, and will therefore correlate best with the low-frequent content of the signal. In the same way, a small scale value results in a compressed wavelet function, making it well suited for the analysis of high frequency signals. Figure 5.13: Differently scaled (s = 1, 2, 4) wavelets. What this actually means is that, unlike the STFT, the wavelet transform uses differently sized analysis functions in order to maximise the exactness of the analysis, a phenomenon referred to 22 Analysis Björkvald, Svensson 2004 as multi-resolution analysis. The advantage of this is that for low-frequency content, the time resolution is very high, and for the high frequency parts of the signal, the frequency resolution is emphasised. This way, the wavelet analysis overcomes the traditional resolution problems of the STFT. Another advantage of the wavelet analysis compared to the traditional Fourier analysis is that since the mother wavelet can be defined in infinitely many ways, a wavelet can contain as many sharp corners and discontinuities as desired. The Fourier analysis is entirely based on using sinusoids, giving it less freedom and possibilities. 5.5.1 Continuous Wavelet Transform (CWT) Much like the STFT, the CWT is performed by convolving a signal with a function, in this case the wavelet declared in (5.13), and the transform is computed separately for different segments of the time-domain signal. The transform can be seen as a filtering with the wavelet function being the filter. The big difference between CWT and STFT is that with wavelets, the width of the “window” (i.e. the wavelet function) changes with the scale value. C ( s,τ ) = ∫ x(t )ψ s*,τ (t )dt (5.14) where * indicates complex conjugate and ψ is the wavelet function defined in equation (5.13). Thus, using a scaling function (theoretically) yielding infinitely many scale values, and translating the wavelet in time is called the Continuous Wavelet Transform. The convolution is performed once for every scale value defined. This results in the two-dimensional matrix C, shaped of rows corresponding to the scales and one column for each sample point of the signal. The contents are coefficients for every scale at every time instant as to how well the corresponding wavelet function matches the signal. By examining the coefficients, the best-fitting scale, and thereby the most likely frequency content can be decided. 5.5.2 CWT in the frequency domain The formula for performing the CWT in the frequency domain (5.15) is similar to the one for the time domain. Cˆ ( s,τ ) = ∫ xˆ (ω )ψˆ s*,τ (ω )dω , where (5.15) ψˆ s ,τ (ω ) = s ψˆ ( sω )e − jωτ (5.16) is the Fourier transform of the wavelet function. As can be seen in equation (5.15), the frequency-based CWT does not include any kind of time shift and is therefore a simple scalar multiplication. Another difference lies in the rescaling of the mother wavelet, since a rescaling by s in the time domain becomes 1/s in the frequency domain. [10] 5.5.3 Visualisation In order to visualise the CWT, a three dimensional plot is an easy way to clarify the results of the analysis. By plotting time contra the scale values on the x- and y-axis respectively, and the coefficients of the transform on the z-axis, the results are visualised as “mountain peaks”, as illustrated in Figure 5.14. 23 Analysis Björkvald, Svensson 2004 Figure 5.14: Coefficients of the CWT performed with the Daubechies 8 wavelet on a 440 Hz sine signal. 5.5.4 Discretisation of the CWT The CWT is quite a demanding process, since the results cover the analysis of the signal using every defined scale for every time instant. Computing the CWT on a computer is actually an impossibility, with all computers being of a discrete nature. A discretisation of the CWT is necessary. The wavelet function and the scale values need to be sampled in order to be used in the transform for a discrete signal. 5.5.5 More sparse discretisation of the CWT In many practical application examples, having a uniformly sampled time-frequency plane will cause redundant information. To remove this, speed up the procedure and make the wavelet transform more manageable, a number of properties can be altered. The most important aspect is the frequency content. Since the scales correspond to the frequency of the signal it is not necessary to use the same sampling rate for every scale. High scales correspond to low frequencies, where the analysis does not rely on a high sampling rate. Therefore, by sampling differently for every scale, a lot of redundant information can be ignored. 5.5.6 Sub-band coding There are a number of different ways of performing this operation. One popular way is to discretisise the scales logarithmically. By using 2 as base for the logarithm, only the values 2, 4, 8, 16, 32 and so on, i.e. scales derived from the expression s = 2k, where k is an non-negative integer, are used as scales. This discretisation, which is part of a technique called sub-band coding, also makes it possible to discretisise the time axis. The scale changes by a factor of two, resulting in the equivalent of a two times lower frequency. Consequently, the corresponding sampling rate for the time axis can be halved, according to the Nyqvist criterion. The dyadic sampling grid (Figure 5.15) represents this method. As the scale factor increases, the frequency being analysed decreases, and a lower sampling rate is required. 24 Analysis Björkvald, Svensson 2004 Figure 5.15: The dyadic sampling grid. As the scale value increases, the amount of necessary sampling points decreases. 5.5.7 Discrete Wavelet Transform (DWT) The sub-band coding technique is the most common way of calculating the Discrete Wavelet Transform (DWT). By using low- and highpass filters, the signal is divided into two parts, each being examined with different scales at different frequencies. The collection of filters is often called a filter bank. The result of the sub-band coding is a number of coefficients, describing the high and low frequency content of the signal, according to the desired level of the analysis. Stepping up to a higher level means repeating this process for the lowpass filtered part of the signal. Figure 5.16: The DWT principle. For every level, a highpass filtered signal describes the details, and the lowpass equivalent the approximation of the signal. For every level of the analysis, the resulting coefficients describe the high frequency content (the details) and the low frequency content (the general approximation) of the signal. Depending on the intent of the transform, using a sufficient number of levels usually means using a lot less sample points than the discretisised CWT, thanks to the dynamic rate of sampling. The DWT is mostly used in image and audio compression methods, and beyond the scope of this thesis. 5.5.8 Wavelet families The possibility of using a unique function as mother wavelet is one of the most thankful aspects of the wavelet concept. There are a great number of mother wavelets to be chosen from, each having their own characteristics and suitable areas to used within. It is also fully possible to design new wavelets. A wavelet function can be complex or real, and often has an adjustable parameter for the localised oscillation. The most simple mother wavelet is the Haar wavelet, which looks like a step function. 25 Analysis Björkvald, Svensson 2004 Figure 5.17: The Haar wavelet. Another, more advanced wavelet function is the Daubechies family, where an integer denotes the number of vanishing moments. By adjusting this number n, the functions take on different shapes. This leads to the wavelets being called “Db n”. For instance, a Daubechies wavelet with four vanishing moments is called a “Db 4”. Figure 5.18: The Daubechies wavelet family (top left the Db 2, top right the Db 4, bottom left the Db 8 and bottom right the Db 16). 26 Analysis Björkvald, Svensson 2004 5.5.9 Conditions for wavelets Even though it is fully possible to design new mother wavelets, there are a number of conditions that need to be granted: The admissibility condition For a continuous wavelet transform to be invertible, the mother wavelet must satisfy the admissibility condition (5.17). 2 ψˆ (ω ) ∫ ω dω < ∞ (5.17) Here, ψˆ is the FT of the mother wavelet ψ. This is only true when (5.18) is fulfilled. ψˆ (ω ) 2 ω =0 =0 (5.18) This means that the wavelet must have a bandpass-like spectrum. A resulting zero of the FT at the zero frequency also means that the average value of the wavelet in the time domain must be zero (5.19). ∫ψ (t )dt = 0 (5.19) It follows that the function is oscillatory, and must be a wave. Hence the name “wavelet”. If the admissibility condition is fulfilled, the inverse wavelet transform can be defined as in (5.20). x(t ) = ∫∫ C ( s,τ )ψ s ,τ (t )dτds (5.20) The signal x(t) is exactly reconstructed from the CWT coefficients, using the same wavelet function. Thus, the transformation is lossless. The regularity conditions The regularity conditions state that the wavelet function should have some smoothness and concentration in both time and frequency domains, in order to make the wavelet transform decrease quickly with decreasing scale s. [32] 5.5.10 Wavelets and music Two related wavelets that have been successfully used in applications related to music and sound are the Gabor and the Morlet wavelets [1]. These functions are complex and based on exponential functions, making them appropriate for analysing sinusoidal sound signals. This in the sense that when the functions does match the signal, the resulting coefficient values are quite high, and when it does not match very well, the coefficient values are very low. This contrast makes it easy to distinguish the scales best suited to describe the analysed signal, in comparison with the realvalued wavelets, where the peaks in the resulting coefficients are wider and harder to distinguish. 27 Analysis Björkvald, Svensson 2004 5.5.11 The Morlet wavelet The Morlet wavelet can be defined as in (5.21). ψ (t ) = 2e −t 2 /α 2 (e j πt − e −π 2 α2 /4 ) (5.21) α is a parameter controlling the bandwidth of the signal. In the frequency domain, the Morlet wavelet (5.22) is a complex bandpass filter, thus its effect as a filter would be to limit a signal to a band centred around a certain frequency. ψˆ (ω ) = αe −α 2 ( π 2 +ω 2 ) / 4 (e πα 2ω / 2 ) −1 (5.22) A modified version of the original complex Morlet wavelet is the Morlet pseudowavelet (5.23). ψ p (t ) = 1 bπ e −(t 2 / b ) + jω 0 t (5.23) Here, ω0 is the centre frequency and b is the bandwidth. By altering the parameters for width and centre frequency, the wavelet can be designed to match normal sound signals very well. This way, the Morlet wavelets are usually referred to as “Morlet ω0 - b”. For instance, a Morlet wavelet with centre frequency 1 and bandwidth 5 is called “Morlet 1-5”. Figure 5.19: Three different Morlet pseudowavelets (1-1, 1-5, 3-5). The two leftmost have the same centre frequency, but different bandwidths. The two rightmost have the same bandwidth, but different centre frequencies. 28 Analysis Björkvald, Svensson 2004 This version of the wavelet is not, strictly speaking, a theoretically perfect choice of mother wavelet, because it does not meet the admissibility condition. That is, it does not integrate to zero. The inverse transform is therefore not possible. However, by choosing a centre frequency large enough, the integral of the pseudowavelet can be made extremely close to zero and the condition is thereby in principle met. Despite being a simplification, the pseudowavelet is in fact quite useful for time-frequency display of signals. Since the FT for the Morlet pseudowavelet, declared by (5.24), is much simpler to define than for the original Morlet Wavelet, it is also faster to use in frequency-based CWT computations. [10] ψˆ p (ω ) = 1 − (ω − ω 0 ) 2 / b e bπ (5.24) 5.6 Conclusion The classic Fourier analysis is an important concept in the field of signal analysis. However, the fact that it doesn’t include any time information makes it an inadequate method for this thesis. With the wavelet analysis, the possibilities of examining audio files and estimate frequencies in time present a straightforward way of determining melodies, the principle aim of the analysis. From this, the information needs to be combined with the synthesis theories in order to fulfil the thesis goals. 29 Synthesis Björkvald, Svensson 2004 6 Synthesis 6.1 Introduction This chapter presents the different theories forming the synthesis part of the thesis. The major aspect here is the Markov chain and its features. Furthermore, some Artificial Intelligence aspects are discussed, and a summary of the MIDI file format is given. The combination of these ideas forms the synthesis method used. 6.2 Markov chains The Markov chains, and more generally Markov processes, are among the most fundamental objects in the study of probability. This chapter is influenced by lecture notes by Petterson [8]. 6.2.1 Statistical model Technically, a Markov process is a random process characterised by a lack of memory, and only depending on the preceding state, a fact known as the Markovian property. Consider a collection of random variables { Xt } (with the index t running through 0, 1, ...) where, given the present, the future is conditionally independent of the past. In other words, a Markov process has the property of (6.1). P ( X t = j | X 0 = i0 , X 1 = i1 ,..., X t −1 = it −1 ) = P( X t = j | X t −1 = it −1 ) (6.1) Markov processes are continuous, while Markov chains are time-discrete implementations. This means that a Markov chain implies a fixed size time step; every transition happens after a certain length of time (6.2). P ( x n = ain | x n −1 = ain −1 ,..., x1 = ai1 ) = P( x n = ain | x n −1 = ain −1 ) (6.2) A Markov process has slightly different transition behaviour. The number of states can be the same as with the Markov chain, but in each state there are a number of possible events that can cause a transition. These events take place at random points in time. This makes the Markov processes continuous. A first order Markov chain only depends on the previous state, while a higher order process would depend on a higher number of preceding events. This is the same as defining the Markov chain as first order, but with a different set of states (6.3). P ( X t | X 1t −1 ) = P( X t | X t − n ,..., X t −1 ) (6.3) An n-th order Markov chain over the alphabet A is equivalent to a first order chain over the alphabet of n-tuples, An. In general, an n-th order Markov process can be transformed to a first order Markov process by introducing new random variable Yt = { X t − n ,..., X t −1 } , yielding (6.4) . P ( X t | X 1t −1 ) = P ( X t | Yt ) (6.4) 30 Synthesis Björkvald, Svensson 2004 For instance, consider the sequence MARKOV. A Markov chain of first order would consist of the states A = { M, A, R, K, O, V }. In this case, a second order chain would use the alphabet A2 = { MA, AR, RK, KO, OV } but still behave like a first order chain. To describe a so-called finite state first order Markov chain, three criteria need to be fulfilled: • • • The system can be described by a set of finite states, and can be in one and only one state at a given time. The probability of a transition from state i to state j, Pij, is given for every possible combination of i and j. The transition probabilities are stationary over the time period of interest, and independent of how state i was reached. The initial state of the system or the probability distribution of the initial state is known. Thus, in order to model a Markov chain, the initial state and the probability distribution of the different states need to be known. These probabilities can be arranged in a so-called transition matrix. The expression in (6.5) shows a probability transition matrix for a system of m number of states. P11 P P = 21 M Pm1 P12 L P22 L M Pm 2 O Pm3 P1m P2 m M Pmm (6.5) Pij represents the constant probability of a transition from state Xi at time t to state Xj at time (t+1). The Markovian property makes P time-invariant. All rows of the transition matrix add to 1, and all values of Pij are greater or equal to 0. In this matrix, all possible successive states and their probabilities are gathered. From this, calculating probabilities for a desired passage of states is a simple problem involving basic probability operations. 6.2.2 Markov chain example Suppose that a system has the transition matrix of (6.6). A B C A 0.2 0.6 0.2 T = B 0.1 0.6 0.3 C 0.5 0.2 0.3 (6.6) This would imply that the possibility of reaching state A from state A is 0.2, reaching state B from state A is 0.6 and so on. Furthermore, assume the initial state as the vector (6.7). S0 = [1 0 0] (6.7) Apparently, the starting state is A. Using the transition matrix it is straightforward to calculate the state vector after j transitions, as given by (6.8). S j = S 0T j (6.8) 31 Synthesis Björkvald, Svensson 2004 For instance, after 2 transitions the state vector looks like (6.9). S 2 = [0.20 0.52 0.28] (6.9) This means that after 2 transitions, there is a possibility of 0.2 that the current state is A, 0.52 for state B and 0.28 for state C. An interesting phenomenon occurs when comparing the state vectors after 6 and 10 transitions, respectively (6.10) . S6 = [0.23 0.49 0.28] = S10 (6.10) The two state vectors have the exact same probabilities. The system has reached its so-called steady state distribution, and shows the constant probabilities for the different states, regardless of the starting position. This happens when the transition matrix T is regular (all components greater than zero for any power of T). A Markov chain represented by such a transition matrix is called a regular Markov chain. 6.3 Artificial Intelligence (AI) The study of human intelligence is one of the oldest research fields. Over the last 2000 years, philosophers have tried to understand how learning, understanding and reasoning is, and should be done. During the 1950:s, the development of computers allowed testing of theories and experiments of a more complex nature than what was previously possible, making the research area more practical and concrete. The computers were initially believed to have an unlimited potential of intelligence. But while the computers offered countless calculations to be performed in an instant, many of the theories believed to address the concept of intelligence failed. Combining intelligence and computers proved itself a major research area of its own. 6.3.1 The AI research field Russell states the AI research as the quest for the solution to one of the ultimate puzzles; how is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict and manipulate a world far larger and more complicated than itself? Unlike other questions, where the answer may not even exist, the evidence of the existence of the answer to the AI quest is clear; just look in the mirror. To understand the diversity of the AI field, looking at the areas in which it appears is sufficient; from general-purpose areas like perception and logical reasoning to specific tasks like playing chess, proving mathematical theorems, writing poetry and much more. However, to nail down the most important fields of study, four different aspects can be considered: • • • • Systems that think like humans. Systems that act like humans. Systems that think rationally. Systems that act rationally. The keyword here is ”rationality”. In AI, the term rationality simply means that the human behaviour contains irrational mistakes, and that consideration has to be given this fact. Therefore, Russell states: “A human-centred approach must be an empirical science, involving hypothesis and experimental confirmation. A rationalist approach involves a combination of mathematics and engineering.” [31] 32 Synthesis Björkvald, Svensson 2004 6.3.2 Simulating human behaviour As far as the AI elements of this thesis goes, the “mistakes” of the human behaviour is of interest. In order to synthesise music that avoids having a cold and machine-like feel to it, it is necessary to actually recreate the small errors and random factors that human fabrications may possess. This way, the purpose could be defined as “a system that creates an output that appears to be made by a human”, rather than the four cases above. The study and consequences of human error is a big research area of its own. Usually, this sort of study adresses computer training systems that allows the computer to learn from its own mistakes. Rauterberg discusses examples of this, which goes beyond the scope of this thesis [9]. 6.3.3 AI and music Looking at the combination of music and AI, the goal of the research is to make computers behave like skilled musicians. This would mean the ability to perform specialised tasks like composition, analysis, improvisation and so on. As in many other AI areas, the emphasis so far has been put on these specialised tasks independently from each other. Therefore, current research is looking for a way to integrate the different tasks in a general application. Since music is not a strict science, but rather the combined effect of both physical and emotional reactions, it is debatable whether the possibility of a superior music machine is desirable. Most traditional musicians keep trying to move music away from this type of automatism, while AI research tries to reduce the gap between computers and music. This conflict makes music and AI a very interesting combination. [27] 6.4 MIDI MIDI is an abbreviation for Musical Instrument Digital Interface. To avoid a common misunderstanding, it is important to realise that MIDI is not a thing that can be owned. It is not a thing that can be touched. What it is, is actually the name of a communications protocol that allows electronic musical instruments to interact with each other. Another misunderstanding is that MIDI was designed to use as sound source for video games etc. In reality the MIDI protocol was created by musicians, for musicians and with the need of the musicians in focus. [13] 6.4.1 MIDI history The saga of MIDI has its origin back in the days when synthesisers began to gain recognition from the public as a proper music instrument (read: late 1970:s/early 1980:s). The breakthrough synthesiser artists had one major problem; it was hard to perform their music live on stage. In the studio, they could layer their electronics sounds on top of each other using multiple tracks, but like everybody else, they only had two hands, which limited the possibility of recreating the music live. To solve the problem, synthesiser technicians from various manufacturers met to discuss ideas. In 1983, their results were revealed at the first North American Music Manufacturers show in Los Angeles. The demonstration showed how two synthesisers, manufactured by different companies, could be connected with cables. One of the synthesisers was played, and both of them could be heard. In order to show the two-way nature of the communication, the process was then reversed in front of an impressed audience. The MIDI principle is very reminiscent of the way two computers can communicate via modem, with the difference being that the computers are synthesisers in the MIDI case. The information being shared is musical in nature, and in its most basic mode tells the synthesiser when to start and when to stop playing a certain note. Other information possible to share is the volume and 33 Synthesis Björkvald, Svensson 2004 the possible modulation of the note. MIDI information can also be more hardware specific; it can tell a synthesiser to change sounds, master volume, modulation devices and much more. Soon it became clear that computers and MIDI would be an ideal combination since they speak the same binary language. The only problem was the fact that MIDI used a data transmission rate of 31.5 kBaud, which was different from all computer data rates. To solve this, an interface was designed, which allowed the computer to talk to MIDI devices. The first companies to establish themselves in the MIDI-computer market were Apple, Commodore and ATARI. Today, almost all types of computer systems have interfaces for the MIDI protocol. [19] 6.4.2 The MIDI commands The very basis for the MIDI communication is the byte, or rather the combination of bytes. Each MIDI command, or MIDI Event, has a specific binary byte sequence, in this chapter expressed in terms of hexadecimals. Each byte is 8 bits long. The first byte is always the status byte, telling the MIDI device which function to perform. Encoded within the status byte is the MIDI channel, ranging from 0 to 15. Thus, MIDI is a 16-track based interface, with the channels being completely independent of each other. Possible actions of the status byte can be Note On, Note Off or Patch Change. Depending on which of these actions the status byte indicates, a number of different bytes will follow. Naturally, the most important commands are the Note On and Note Off cases. If a Note On is sent, the MIDI device is told to begin playing a note. Two additional bytes are required; a pitch byte, deciding which note will be played, and a velocity byte, which sets the force of the pressed key. The velocity note is not supported by all MIDI devices, but is still required to complete a Note On transmission. A Note Off indication uses the same structure as the Note On command. To exemplify a Note On command, it could appear like Table 6.1. Binary code Hexadecimal 10010000 90 00111100 3C 01110010 72 Table 6.1: MIDI Note On command. where the 90 figures indicates a Note On command for MIDI channel 0, 3C is the key pressed (translating into the decimal number 60, a C4) and 72 is the velocity with which the key was pressed (resulting in the force 114). A MIDI keyboard has 128 keys, meaning that the key variable range between 0 and 127, or in hexadecimal notation 00 and 7F. The velocity variable has the same range. If a Note On command with a velocity of 0 is executed, it is actually interpreted as a Note Off. When the key is released, a Note Off command is sent to the MIDI device. Binary code Hexadecimal 10000000 80 00111100 3C 01100011 23 Table 6.2: MIDI Note Off command. Just like Note On, it consists of three bytes. The first one (80) indicates a Note Off command for channel 0, the second one (3C) indicates which note is to be turned off and the last one (23) sets an off-velocity. A Patch Change command instructs the MIDI device which of its built-in sounds should be played. The General MIDI library is the standard instrument list. Using this standard it does not 34 Synthesis Björkvald, Svensson 2004 matter on which synthesiser a tone is played. For instance, playing a tone with patch number 0, the sound will always be a Acoustic Grand Piano. The Patch Change command requires only one byte; the number corresponding to the patch number on the synthesiser. Binary code Hexadecimal 11000000 C0 01001010 4A Table 6.3: MIDI Patch Change command. This command changes the patch on channel 0 to instrument number 4A (or 74 expressed in decimal notation), which in the General MIDI case would be a flute. In the General MIDI library, the instruments are divided into 16 different families. This means that within the patch numbers 1 to 8, the “Piano” family of instruments will always be found, and so on. 6.4.3 Standard MIDI file format In order to use the MIDI commands or events as defined above, a MIDI file needs to have a certain appearance. A standard MIDI file consists of different types of so-called chunks; a header chunk and an arbitrary number of track chunks. A track in a MIDI file can be thought of as the equivalent on a multi-track tape deck; it may represent a voice or an instrument. Header chunks The header chunk appears at the beginning of the MIDI file, and describes the file format. MIDI File header: [ 4D 54 68 64 ] [ 00 00 00 06 ] [ ff ff ] [ nn nn ] [ dd dd ] The first four bytes translate in to the ASCII letters “MThd”, indicating the start of the MIDI file. The next four bytes represent the header length, always six bytes. The [ ff ff ] information is the file format. There are three different formats of MIDI files; single-track, synchronous multiple-tracks or asynchronous multiple-tracks. Single-track means that there is only one track, synchronous tracks mean that the tracks all start at the same time, which they do not in the asynchronous case. [ nn nn ] is the number of tracks in the file, and [ dd dd ] is the number of delta-time ticks per quarter note, stating how many ticks after the previous event it should be executed. Delta-time is a variable-length-encoded value. This format allows large numbers to use as many bytes as they need, without requiring small numbers to waste bytes by filling with zeros. Some examples of numbers represented as variable-length quantities are stated in Table 6.4. Fixed size hexadecimal format 00000000 00000040 0000007F 00000080 00002000 Table 6.4: Variable-length encoding examples. 35 Variable-length format 00 40 7F 81 00 C0 00 Synthesis Björkvald, Svensson 2004 Track chunks After the header chunk comes one or more track chunk(s). Each track chunk has a header, and may contain an arbitrary number of MIDI commands. The header for a track is similar to the file header. MIDI Track header: [ 4D 54 72 6B ] [ xx xx xx xx ] The first four bytes have the ASCII equivalent of “MTrk”, indicating a MIDI track. The four bytes after this statement give the length of the track (excluding the track header), stating the number of bytes occupied by following MIDI events. Each event is preceded by a delta-time. [26] 6.4.4 MIDI file example To give a proper overview on a standard MIDI file, an example could look like in Table 6.5. MIDI Command MIDI File Header Number of bytes in header (6) Hexadecimal notation [ 4D 54 68 64 ] [ 00 00 00 06 ] MIDI File Format (1) Number of tracks (1) Ticks per quarter note (96) [ 00 01 ] [ 00 01 ] [ 00 60 ] MIDI Track Header Number of bytes in track (31) [ 4D 54 72 6B ] [ 00 00 00 1F ] Patch change, program 2, channel 1, deltatime 0 Note on, channel 1, C4, velocity 64, deltatime 0 Note off, channel 1, C4, velocity 64, deltatime 30 Note on, channel 1, E4, velocity 64, deltatime 0 Note off, channel 1, E4, velocity 64, deltatime 30 Note on, channel 1, G4, velocity 64, deltatime 0 Note off, channel 1, G4, velocity 64, deltatime 30 End of Track, deltatime 0 [ 00 C0 02 ] [ 00 90 3C 64 ] [ 30 80 3C 64 ] [ 00 90 40 64 ] [ 30 80 40 64 ] [ 00 90 43 64 ] [ 30 80 43 64 ] [ 00 FF 2F 00 ] Table 6.5: MIDI file example. Since the C, E and G notes all start and end at the same time, this is actually a C-major chord. [6] 6.5 Conclusion The techniques explained in this section form the basis of the synthesis part. The Markov chain is used to generate new tone sequences, the AI features make the output appear more “human”, and the MIDI specifications are used to make it listenable. The implementation of these ideas, along with the analysis theories described in the previous chapter, forms the application which accompanies this thesis. 36 Implementation Björkvald, Svensson 2004 7 Implementation 7.1 Introduction This chapter will in detail explain how the various theoretical principles were implemented to form a Matlab-based application. 7.2 Analysis The idea of the analysis part of the application was to find the tone sequence of the input sound file. By doing this for several inputs, a database containing statistical information was passed on to the synthesis. Figure 7.1: Workflow of the analysis. The dashed grey window symbolises octave separation. After studying the wavelet theory and looking at example applications, the focus was put on understanding and improving the Multiana program, and trying to combine its ideas with the functions in Matlab’s wavelet toolbox. Multiana performed a tone analysis, which was very appealing, but lacked speed and a clear structure as to what type of analysis was actually made. 7.2.1 CWT analysis An example program on the official Matlab reference guide page, also able to find and identify frequencies in time, was altered to perform the CWT with the scales being actual tones, rather than simple integer values. To implement this idea, a formula was used to define the required frequencies (i.e. tones), and then transform them into scale values for the mother wavelet, as given by (7.1). [22] 37 Implementation s= Björkvald, Svensson 2004 fc ⋅ fs f (7.1) fc is the center frequency of the given wavelet, fs is the sampling frequency and f is the frequency of the current note of interest. By doing so for every musical tone within five octaves, ranging from around 65 Hz to 2000 Hz, a CWT analysis could be performed with these adapted scale values. Originally, test sound signals were created as pure sinusoid signals. The reason for this was that ”clean” signals correlated better with the wavelet, and resulted in higher coefficient values. As a result, the differences between the various wavelets could easier be spotted. A pure sinusoidal A4 note can be defined as in (7.2). A = sin (440 ⋅ 2 ⋅ π ⋅ t ) (7.2) The built-in Matlab CWT function resulted in a large matrix with coefficients for the 60 different tones all along the signal, showing how well the signal matched the wavelet function along the signal. A number of different types of mother wavelets and their characteristics were examined in order to find the most suitable one. Originally, the test programs used conventional mother wavelets, for instance the Daubechies wavelet. The results varied for the different wavelet types, and the CWT seemed very slow. Furthermore, the resulting coefficients did not impress, as shown by Figure 5.14. The algorithm did find the scales most likely to represent the tone of the signal, but the “top” of the resulting coefficients was not narrow enough to decide one unique scale (and tone). The problem of determining one tone from the coefficient matrix became obvious when plotting the tone (frequency) versus time representation, where the blacker the colour, the higher the coefficient value (Figure 7.2). Figure 7.2: 2D time-frequency plot of CWT coefficients using Daubechies 8 wavelet. The signal being analysed is an A4 (440 Hz). The use of the Daubechies wavelet was clearly unsatisfying, and something presenting more distinct results for musical signals was needed. The Morlet wavelet’s similarity to a theoretically perfect (sinusoidal) music signal made it very interesting for this type of analysis. With it being one 38 Implementation Björkvald, Svensson 2004 of Matlabs built-in mother wavelets, it could be used directly in the Matlab CWT algorithm. For the moment, this saved a lot of programming work. Figure 7.3: Coefficients of the CWT performed with the Morlet 1-5 wavelet on a 440 Hz sine signal. As seen in Figure 7.3, the resulting coefficients were now improved, and indicated that the nature of musical signal was best explained as a complex expression. The plot of the coefficients from the CWT showed that the peak clearly gave the most appropriate scale value. Since the scale best suited to describe the signal was easy to decide from the coefficient matrix, the tone was also easy to identify directly from the scale value, using equation (7.1). Plotting these results showed a much better picture of the tonal content, as illustrated in Figure 7.4. Figure 7.4: 2D time-frequency plot of CWT coefficients using a complex Morlet 1-5 wavelet. The signal being analysed is an A4 (440 Hz). Knowing that for larger signals, filtering through convolution was a much more demanding process than filtering through scalar multiplication in the frequency domain, performing the wavelet transform in the time domain was obviously not the optimal choice. So to substantially decrease the number of computations for the transform, working in the frequency domain was desired. At 39 Implementation Björkvald, Svensson 2004 this point, using the Matlab wavelet toolbox was no longer an option, since it only included methods for transforming with convolution. Fortunately, there were a number of independent wavelet toolboxes for Matlab with the source code available on the Internet. One of them was YAWTB, which did in fact perform the CWT in the frequency domain [14]. However, the mother wavelets available in this toolbox were not nearly as good as the Morlet pseudowavelet for analysing musical signals. Still, the way the transform was performed was of great interest and the principle in itself could be used. Just a couple of lines of code from YAWTB combined with some wavelet theory and the definition of the Morlet pseudowavelet in the frequency domain [10] formed the foundation of the CWT function being used in the final version of the application. 7.2.2 Improving performance The application was now able to identify which tone was constructed in the sinusoid test signal. However, to make the application able to analyse “real” music signals, the input files were hereafter selected as mono audio in the formats .wav or .au, since the Matlab environment fully supported the reading of these formats. The motivation for using mono was that the CWT could only perform analysis on one signal at a time, while stereo sound consisted of two channels (signals). Performing the CWT calculations for a signal with hundreds of thousands of samples over several octaves (each octave having 12 scales) was extremely demanding, even for a fast computer with lots of memory. To improve the performance, some sort of skip variable needed to be implemented, making the program analyse the signal only at each skip:th element. But simply performing the calculations this way gave very poor and extremely oscillatory results. It was necessary to skip values in a more intelligent way, looking at the content of the signal before deciding the step length. The new approach was to downsample the signal as much as possible without losing important information. According to the Nyqvist criterion, the sample frequency of a signal cannot be lower than twice the highest included frequency in order to fully describe the contents of the original signal. Theoretically this meant that if the highest frequency could be found, the CWT calculations could be performed on the signal downsampled to twice this frequency, still giving the same results. Since the tones played in the test files often appeared in the fourth or fifth octave, the sampling frequency needed was seldom higher than 2 kHz. This was a major difference from the CD-quality sampling rate of 44.1 kHz which was used in most signals, meaning that they could be downsampled up to ten times, making the CWT calculations significantly faster. In practice, downsampling an audio signal that much produced unlistenable results and it is usually claimed that it is not recommended to use a sampling frequency lower than five to ten times the highest frequency. However, in this case, the signals were used only for calculations and were not reconstructed for listening purposes. The CWT actually produced satisfying results when downsampled with a sampling frequency of a mere 2.5 times the highest frequency. The gain of using a higher sampling frequency had to be weighted against the cost of longer computation times. For the sake of being able to address this matter, the multiplication factor of the highest frequency introduced itself as a variable in the application. 40 Implementation Björkvald, Svensson 2004 7.2.3 Fourier spectra To decide the optimal downsampling factor, the highest frequency of the input audio signal needed to be found. Thus, the FFT was performed on the signal, producing a vector of frequency magnitudes. The highest frequency was then the index of the last non-zero element in the vector (rescaled by sampling frequency and number of samples). Unfortunately non-zero elements appeared, corresponding to frequencies far above the highest frequency of any importance at all to the signal content. Only the strongest frequencies were of interest, since the focus was put on the fundamental tones of the melody, rather than its harmonics. Therefore, a threshold was set, specifying a percentage of the highest magnitude in the FFT vector. A frequency had to have a corresponding magnitude higher than this threshold, or else it would not be accounted for. The thresholding process passed on entire octaves, meaning that the CWT was always performed on n · 12 tones, where n is the number of interesting octaves. The octave containing the frequency with the highest magnitude would always be analysed. Figure 7.5: Thresholding using a factor of 50 %. The red window shows the interesting frequencies selected from the FFT spectra. The entire octave(s) containing the interesting frequencies is always used, as seen in the right image. This threshold could also be used to specify at which depth the analysis would be performed, since it implicitly set the new sampling frequency. Using a low threshold, more frequencies were covered, and the CWT found more harmonics. This increased the computation time. If only the fundamental tones were interesting, the threshold could be set to a high value, saving a lot of valuable time. 7.2.4 Normalisation and compressor usage Obviously, a real life music signal was not as pure as a computer-generated sinus tone, and caused the CWT to produce less clinically perfect results. To rectify this, consideration needed to be given to the spectra of amplitudes in the input signal. A natural way to avoid this problem would be to normalise the signal, so that its maximum amplitude was set to one, just as in a pure sinus signal. However, the internal relationship of the audio levels in the signal also posed problems; the louder tones could cause the CWT to miss the weaker ones. To some extent, reducing the amplitude difference in the signal using a compressor solved this. The compressor found amplitudes higher than a certain threshold value, and set all these amplitude values to the threshold. Actually, this made the compressor a limiter. This was followed by another normalisation, which stretched the amplitude spectra, and made it easier to analyse. 41 Implementation Björkvald, Svensson 2004 Figure 7.6: Uncompressed signal (left) and the same signal compressed with a threshold of 0.3 and then normalised (right). 7.2.5 Downsampling From the highest frequency of interest, the signal was downsampled correspondingly. The downsampling factor d can be expressed as given by (7.3). f d = s kN ⋅ fh (7.3) Here, fs is the original sampling rate of the signal, kN is the multiplication factor for the Nyqvist criterion and fk is the highest frequency found in the FFT spectra. Thus, the downsampling meant that every d:th amplitude value was kept. This procedure introduced a new and disturbing problem. The CWT now found harmonics even in a pure sinus signal, which in theory should not produce any harmonics at all. This was due to aliasing problems, and the signals had to be lowpass filtered before the downsampling. By doing so at half the desired sampling frequency, the artefacts of the aliasing disappeared. 7.2.6 Octave-wise analysis After the adaptive downsampling of the signal, the CWT could be performed faster, and the program was now a completely functional tone detection implementation. Next, the plan was to try to separate different kinds of instruments from the signal, and analyse them individually. By investigating the frequency ranges of a number of normal “pop music” instruments (bass guitar, guitar, piano and drums) the conclusion was that their frequency ranges in many cases overlap. An efficient and intelligent identification of the instruments from the CWT analysis would therefore be hard to implement, which led to an abandonment of these plans. Instrument identification was not a feature necessary for the basic concept of the thesis ideas, but rather something to implement in a more sophisticated version of the application. Still, the idea of separating the signal’s frequency spectra into smaller parts was not abandoned completely. Analysing one octave at a time allowed much more optimisation of the calculations, due to a number of possibilities: 42 Implementation Björkvald, Svensson 2004 To use fewer scales. Since one octave consisted of twelve tones, no CWT calculations were performed with more than 12 scales. This resulted in faster computations and smaller amounts of data to be handled simultaneously. Even though the total number of analysed octaves might not be less than before, the separation of them eased the computational stress of the CPU. To downsample according to the highest frequency in the particular octave. By keeping track of the current octave being analysed, the highest frequency included in this octave was always known. Thus, an optimal downsampling rate could easily be decided for each analysis, meaning that the lower octaves could be downsampled to a much higher extent. More downsampled signals gave the CWT analysis less data to analyse, which decreased the computation time. To avoid performing any calculations for octaves not containing any strong frequencies. From the FFT analysis, the lowest and highest octaves containing strong frequencies were found. By examining every octave in between, looking for frequencies strong enough to be accounted for, the application decided whether or not the particular octave should be CWT analysed. No unnecessary computations were then performed. Using these ideas, the signal was analysed with twelve scales for each octave, where the interesting ones were selected from the FFT in the thresholding process. The frequencies of the tones were known, and could be used to define the corresponding scale values. Based on the highest frequency, an optimal sampling frequency was set. Each octave could then be downsampled as much as possible. fs dj = k N ⋅ f hj (7.4) The downsampling factor dj was decided for each octave j prior to the CWT analysis. fhj is the highest frequency for the octave j. The lowpass filtering was then performed for each octave, removing the frequency content above fhj. The octave separation is illustrated in Figure 7.7, and is also symbolised by the grey dashed window in Figure 7.1. 43 Implementation Björkvald, Svensson 2004 Figure 7.7: Adaptive downsampling and interpolation of the CWT results. The CWT analysis resulted in a matrix for each octave, with coefficients for the different tones. Since each octave was downsampled differently, the number of columns in the matrices was not the same. Therefore, the matrices needed to be “smeared” in order to be able to assemble them back into one big matrix. This was done by making all matrices the same width as the least downsampled one (the octave containing the highest interesting frequency), and interpolating the coefficients. By doing so, the results for each octave could be piled on top of each other, making the matrix look like it had actually been calculated using one CWT analysis, but much faster. The procedure was reminiscent of the DWT:s sub-band coding technique, with the difference being that the different levels were smeared to the same size and put back together into one matrix. The realisation of the octave separation sped up the CWT part of the application substantially and it also increased the capacity to handle minutes worth of high quality data rather than just seconds, as was the case prior to the optimisation. 7.2.7 Binary threshold The CWT analysis proved itself able to accurately find the tones played in a musical piece. It was also possible to affect the depth of the analysis, by adjusting the threshold selecting interesting frequencies from the FFT spectra of the input signal. This way, finding harmonics was not a problem for the application, but may result in a slower analysis. However, harmonics were not the primary interest, but rather finding the sequence of the melody’s fundamental tones. These resulted in the highest coefficient values of the CWT matrix. To clean up this matrix and get rid of weaker matches, for instance harmonics, the matrix was thresholded into a binary version, setting all values over a certain value to one, and the rest as zero. This way, only the strongest tones were left in the matrix. 44 Implementation Björkvald, Svensson 2004 Cb = C ≥ T (7.5) Cb is the binary matrix obtained by setting all values higher than the constant T of the original CWT result matrix C to one. An effect of the thresholding was that oscillating coefficients in the matrix were evened out, due to the binary nature of the operation. The oscillatory behaviour might be of interest if the aim is to perform some sort of quantisation of the entire analysis, but since the fundamental tones were the primary interest here, the binary representation was a more suitable way of looking at the results. The effect of the binary threshold was apparent when comparing the original matrix to the newly obtained one (Figure 7.8). Figure 7.8: Resulting CWT matrix (left) and binary equivalent, using a threshold of 0.2 (right). 7.2.8 Holefilling There were problems associated with the binary thresholding. Due to the oscillatory behaviour of the coefficients, tones could be split up; if the value oscillated lower than the threshold factor, there appeared “holes” in certain tones. This eventually lead to the application misunderstanding the melody sequence, since a single tone could be interpreted as two individual tones. Another problem came from the fact that the CWT results sometimes showed very small wrongly estimated peaks at neighbours to the correctly identified tones. Figure 7.9: “Holes” and “peaks” from the CWT analysis (left) cleaned up, using an e-value of 0.1 (right). 45 Implementation Björkvald, Svensson 2004 Looking at the nature of these artefacts, it was easy to understand that they were actually of the same type. In the hole case, it was necessary to fill out the matrix with a certain amount of ones, and in the peak case, ones needed to be removed. This was performed using yet another variable e, specifying a percentage of the sampling rate of the result matrix. By looping through the binary matrix for each tone and making sure that the next entry comes within the e interval, possible holes in the analysis was found. If the next entry was outside the interval, it was regarded as a new event. In the same loop, an event was required to be longer than e points to be saved at all. After performing this thresholding with a suitable value of e, the peak and hole problems were solved. Left was a matrix only containing the most important tones, as illustrated in Figure 7.9. 7.2.9 Event matrix To simplify the melody identification and MIDI writing of the analysis results, the binary matrix was transformed into another matrix. The purpose of this was to view the results as events; what tone was played, when did it start and end. Every position of the binary matrix containing information (every tone content found in the analysis) was translated into a row in a new threecolumn matrix. That is, every single tone information found, at all time instants, was saved. An illustration of this is given by Figure 7.10. Figure 7.10: Binary matrix transformed into an event matrix. The tones are selected in a time-wise order, and every tone transforms into two rows of the event matrix. With this convenient way of saving the information, it was easy to sort all events by their start or stop time. This enabled the definition of each tone played, expressed in tone numbers as used in MIDI notation, and the exact start- and stop time for every event. From this, MIDI files could be written easier than by looking at the original binary matrix. 7.2.10 Storing the results With the event matrix, it was a trivial problem finding the played tone sequence. By simply looking at the start times for all notes, the order in which they were played was saved as a text string. Every saved tone was represented by two letters; the first indicating the semitone, and the second the corresponding octave number. To differ a tone from its raised equivalent (i.e. a G from a G#) without using two symbols, the raised ones were denoted as uppercase letters. This resulted in the following tone range: 46 Implementation Björkvald, Svensson 2004 [cCdDef FgGaAb] A typical tone sequence can look like Figure 7.11. Figure 7.11: Analysed tone sequence. Figure 7.11 turned into the desired letter representation gives the following result: [ f4g4a4f4g4a4g4g4g4g4f4f4f4a4g4f4a4g4f4a4g4A4a4g4f4f4g4a4f4g4a4g4g4g4g4f4f4 ] For every analysed sound file, the resulting tone sequence was saved in a text file, which formed the “database” of the application. Furthermore, the length of every event was read from the event matrix and saved in the database file, for later use in the synthesis part. By having an external file as a database it was possible to open the application with any valid file, and continuing building upon it, rather than starting over with an empty database every time. In addition, the synthesis part of the application could be started with any previously saved database, without having to do a new analysis. 7.3 Synthesis The idea of the synthesis was to examine all the tone sequences in the database created in the analysis part, and from them create a Markov model. This model, based on a singly linked list representation, was then used to generate an output built on the statistics from all input, in the form of a Markov chain implementation. 47 Implementation Björkvald, Svensson 2004 Figure 7.12: Workflow of the synthesis. The rounded boxes symbolise properties that can be altered. 7.3.1 Markov model The linked list model required an object oriented Matlab programming approach, which was a new and exciting experience. With Matlab offering no possibilities of creating pointers to objects, a specific class was borrowed from the Data Structures & Algorithms Toolbox for Matlab [15]. At the very basis of the Markov model was an instance of a linked list, called prefixObject.list. To this list different so-called prefix objects were added. The prefix objects consisted of two things; a prefix, which was a string sequence of tones of a certain length p and a chain. Figure 7.13: The list containing prefixObjects. The chain object was pointing to another list object link.list. It also had a variable called total, which was an integer noting the number of links in the link list attached to it. The link objects had a variable called chr, which was the string equivalent of the found tone. They also had a count variable, which was the total number of occurrences of this given tone. 48 Implementation Björkvald, Svensson 2004 Figure 7.14: Chain object with associated list. Assembling the different objects gave the model the appearance of Figure 7.15. Figure 7.15: Full Markov model. Object name prefixObject.list prefixObject chain link.list link Variables prefixObjects prefix, chain link.list, total Link chr, count Table 7.1: Object table for Markov model. 49 Implementation Björkvald, Svensson 2004 The principle of the input analysis was to slide a window of a predefined size over all tone sequences in the database file. The size of the window corresponded to the size of the prefixes of the model. For every prefix inside the window, a prefix object was created and the following tone was stored in link.list within the chain object. This way, every analysed prefix was stored, and with it all possible following tones. If a prefix that had already been stored was found again, its next tone was added to same chain. If the tone existed in the chain, the count variable was increased by one, and if the tone was new it was stored as a new link. By repeating this procedure for all inputs, the result was a Markov model consisting of all prefixes of the length p. Associated to each prefix were the possible successive tones, zero tones or several. 7.3.2 Prefix length The most important factor during the analysis of the tone sequences was the prefix length. By being able to change the prefix length, the order of the Markov chain could be altered dynamically. Practically, this was implemented by using a variable sized queue, with a p noting how many tones are stored simultaneously in the queue. If p was changed, the input was re-analysed with the new prefix length. The idea was that by increasing p, and thereby increasing the order of the Markov chain, the output would have more in common with the input sequences, and vice versa. A high value of p resulted in the application using longer patterns of tones as prefixes. A low value on the other hand, meant that the application only used a very short pattern of tones as prefixes, resulting in a more randomly created output. 7.3.3 Tone sequence analysis example An example of a tone sequence analysis using the prefix length of four (p = 4) is given below. The tone sequence in this example is the typical output of the analysis declared in chapter 7.2. [ f4g4a4f4g4a4g4g4g4g4f4f4f4a4g4f4a4g4f4a4g4A4a4g4f4f4g4a4f4g4a4g4g4g4g4f4f4 ] The first step of the statistical analysis would be to store the first four tones in the queue: Queue: [ f4g4a4f4 ] Successor: [ g4 ] A prefix length of four means that the queue contains four tones at a time, and always stores the following tone as a possible successor to the current prefix in the queue. In this case, the prefix [ f4g4a4f4 ] will have the tone [ g4 ] stored as a possible successor, as shown in Figure 7.16. 50 Implementation Björkvald, Svensson 2004 Figure 7.16: Prefix and followers added to the Markov model. Next, the [ g4 ] tone is put in the queue, meaning that the first tone is thrown out: Queue: [ g4a4f4g4 ] Successor: [ a4 ] With the current queue now being [ g4a4f4g4 ], the successor [ a4 ] is stored for this prefix. The procedure is repeated for the entire sequence. Figure 7.17: All queued prefixes are added to the Markov model. 51 Implementation Björkvald, Svensson 2004 Passing the entire tone sequence through the queue gives the model the relationship between prefixes and their following tones shown by Table 7.2. Prefix2 [ g4f4f4f4 ] [ f4f4f4a4 ] [ f4f4a4g4 ] [ f4a4g4f4 ] [ a4g4f4a4 ] [ g4f4a4g4 ] [ f4a4g4A4 ] [ a4g4A4a4 ] [ g4A4a4g4 ] [ A4a4g4f4 ] [ a4g4f4f4 ] [ g4f4f4g4 ] [ f4f4g4a4 ] [ f4g4a4f4 ] [ g4a4f4g4 ] [ a4f4g4a4 ] [ f4g4a4g4 ] [ g4a4g4g4 ] [ a4g4g4g4 ] [ g4g4g4g4 ] [ g4g4g4f4 ] [ g4g4f4f4 ] Following tone [ a4 ] [ g4 ] [ f4 ] [ a4 ] [ g4 ] [ f4 ] [ A4 ] [ a4 ] [ g4 ] [ f4 ] [ f4 ] [ g4 ] [ a4 ] [ f4 ] [ g4 ] [ a4 ] [ g4 ] [ g4 ] [ g4 ] [ g4 ] [ f4 ] [ f4 ] [ f4 ] Occurrences 1 1 1 2 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 1 Table 7.2: Prefixes and followers of the tone analysis example. Only the [ g4f4a4g4 ] prefix has two different tones as possible successors, with the analysis finding a [ f4 ] on one occasion and an [ A4 ] on another. As seen from the number of occurrences, some of the rest of the prefixes have also appeared more than once, but always with the same following tones. From this information, the statistics are fairly obvious; for all prefixes except [ g4f4a4g4 ], the following tone is 100 % certain. For this prefix, there is a 50 % chance that the next tone is a [ f4 ] and a 50 % chance of it being an [ A4 ]. Building a table like this for all input creates an equivalent of the transition probability matrix mostly used in Markov chain applications. 7.3.4 Creation of new tone sequences With the prefix model used as the probability source, a Markov chain implementation was used to generate the output. Again, this was performed using a queue system. If the analysis was performed with a prefix length of p, the Markov chain would use the same prefix length. This meant that the new queue was of the same size. 2 Note that the prefixes are not in the same order as the input is processed. 52 Implementation Björkvald, Svensson 2004 The principle is simple; by randomly picking a new tone from the associated links to the current prefix and putting it in the queue, a new prefix is created, with new possible successors. Using the example in 7.3.3, the procedure looks like this: Queue: [ g4f4a4g4 ] Possible successors: [ f4 ], [ A4 ] A random number is created, which is used to step through the link list a certain number of times. If the [ f4 ] is selected, it is inserted into the queue: Queue: [ f4a4g4f4 ] Possible successor: [ a4 ] The queue content is now a new four-tone prefix. In order to find the next successor, the application finds the prefix object associated with this prefix, and looks at the possible following tones. In this case, the prefix [ f4a4g4f4 ] always leads to next tone being an [ a4 ]. To guarantee that the output will be longer than one prefix, the starting prefix was always randomly selected from all prefixes in the model having any possible successors. There were two possible cases of ending the synthesis; when a prefix with no successive tones was currently in the queue, or when the output length reached a predefined maximum length. This way, the user was able to control the length of the application output. If the result was satisfying for the user, it was possible to save the tone sequence as a MIDI file. If not, by shuffling the output, a new sequence was created. Every generated tone sequence ended with the same tone as it began with. This was a feature inherited from most existing real life music, and it made the melodies feel like they had a proper ending. Using this simple procedure, the model was used to create the new output tone sequence. Since it contained information for all input, using it to synthesise new sequences was a statistically safe way of combining all the input characteristics into one unique output. The synthesis was purely based on what was found in the analysis, and all generated combinations of tones are guaranteed to have been found somewhere in the input. If a certain tone combination was common, it was more likely to appear in the output. 7.3.5 Controlling the characteristics of the output At the same time as the tone sequences were read from the database, all stored tone lengths were read into a vector. When a new tone sequence was synthesised and written to a MIDI file, these lengths were used for deciding the length of each tone through random selection from the values in the vector. A shorter tone than the shortest value could not be created, and likewise, the highest value was not possible to override. But just randomly selecting values for the tone lengths produced highly unrhythmical results, sounding much like a small child tinkling away on the piano (or any other chosen MIDI instrument). And like the piano playing of small children, this hardly had any musical value at all to anyone but the parents (or in this case; the programmers). In order for the output to be more listenable, there was clearly a need to take control of the tempo for the melody being played. Introducing a variable for tone length variation, it was possible to specify the amount the tone lengths were allowed to differ from the mean value of the length vector. All values that fell outside the specified range were trimmed. Naturally, it followed that the smaller the variation, the more alike the tone lengths were. However, a zero percent variation created results that were too perfect even for a skilled pianist, and at least a few percent of tone length variation were needed. 53 Implementation Björkvald, Svensson 2004 Setting a small tone length variation introduced an even tempo throughout the new synthesised “song”, based on the mean value of all lengths from the input. But what if a tempo change was desired? A logical progression from being able to control the amount of variation was to introduce another variable, which displaced the value the variation was centred around. This could be set to any value within the range of the length vector. As a result, changing the length of the tones implicitly set the tempo. If the real tempo (in beats per minute) was of any interest, it could easily be approximated by dividing 60 by the centre value. To make the output sound “rhythmically” correct, a quantisation of the length vector was performed. The centre value was divided by two as many times as possible without being smaller than the shortest time value allowed. The resulting value was used as quantisation factor, making all other values multiples of it. This way, all tone lengths were related, making them sound better together. The quantisation factor was also used for deciding the spacing between tones. This introduced a more dynamic sound, as the melody became less mechanical. 7.4 MIDI representation Since the analysis part offered possibilities for a wav2midi-conversion of the analysed input, and the synthesis created new tone sequences, based on the same structure, both of the cases required a writer that transformed the string sequences into actual playable sound. This was performed by using the event matrix, and transforming it into a MIDI audio file. 7.4.1 Writing the MIDI format The event matrix was of the form defined by (7.6). time start M M time stop note M note (7.6) time is the sample value where the event occurs, start/stop is a one or a zero, respectively, and note is a MIDI note number between 0-127. Every note that had a start event must also have a stop event, resulting in twice as many rows in the event matrix as there were tones. For this thesis, the method chosen for playing the new synthesised music was of far less importance than the synthesis itself. For simplicity MIDI was chosen because it was a well-known standard and there were plenty of information about it to be found on the Internet. A useful example of the latter was Mosley’s thesis, which provided a complete MIDI writing routine for Matlab [6]. This MIDI writer was somewhat modified for use with the application described. All times were recalculated from sample values to MIDI ticks using the formula (7.7). t MIDI = t s bpm ⋅ ppqn fs 60 (7.7) Here, tMIDI is time in MIDI ticks, ts is time in sample value, fs is the sampling frequency, bpm is the tempo in beats per minute and ppqn is the resolution of the MIDI file in ticks per quarter note. By dividing the sample value by the sampling frequency, time was converted into seconds and all dependency of the sample rate was removed. 54 Implementation Björkvald, Svensson 2004 These MIDI tick values were then transformed yet another time, into relative time values, deltatime. Delta-time was the time from the last occurred event to the current one, and saved a lot of data since the time between two events rarely was large compared to the absolute time value of an event. When writing the delta-times to a (binary) MIDI file, even more data was saved by utilising variable length format, meaning that only the necessary number of bytes (in principle, never more than two) were used. In order for the notes being played to sound more “human” and less mechanic, the attack velocity of each start event was set to a random value between 100 and 127. This was an AI approach simulating the trivial fact that a human being is not in any way an exact machine. Making the flaws and mistakes of regular persons visible in an application made it trustworthier – people in general tend not to rely on things that are “too perfect”. 7.5 Application process flow The flow of the application process is visualised by Figure 7.18. Figure 7.18: Workflow of the entire application. 55 Implementation Björkvald, Svensson 2004 7.6 Conclusion The analysis was performed using a frequency domain based CWT algorithm, which downsampled the signal differently for different octaves, analysed and stored the result in a coefficient matrix. This matrix was then transformed into an event matrix, storing the different tones in time order, which resulted in the exact tone sequence found in the sound file. By saving this sequence in a text file, along with every found tone length, the analysis was completed, and the file was passed on to the synthesis part. The purpose of the synthesis part of the thesis was to generate new tone sequences by analysing the ones found in the analysis part. Through defining a so-called prefix length, the database file containing all input tone sequences was analysed. All existing prefixes and their possible successors were stored in a Markov model. The statistics for all input was collected in the same model. A Markov chain implementation then used this model and its associated probabilities and created an entirely unique output purely based on the input tone sequences. By transforming the output into a MIDI file, it could be listened to. It was also possible to manipulate the output with some parameters affecting tempo, tone length and variation. 56 Results Björkvald, Svensson 2004 8 Results The final result of the thesis work was an application that analysed audio files containing music, stored the tones of the melodies played and statistically synthesised new material based on one or several inputs. All kinds of output of the application could be represented in the MIDI format, whether it was an interpretation of the input (i.e. a “wav2midi” conversion) or a synthesised unique melody. A GUI was implemented, in order for easier manipulation of a number of variables and thereby the data. 8.1 Analysis To focus the analysis on only the most interesting frequencies, or tones, an examination of the frequency content of the input signal was performed. From this, the relevant frequencies were selected, and passed on octave-wise to a continuous wavelet transform (CWT) analysis. The scales of the CWT were selected to correspond with the tones of the current octave, resulting in correlation coefficients for every tone at every sample point of the signal. Prior to the CWT, the signal was downsampled as much as possible without losing information for the particular octave. By doing so, a lot of computational effort could be avoided. The data was processed in some ways for easier use and clearer results. Up to this point, the application was more or less a straight up “wav2midi” converter. For synthesis purposes, all found tone sequences were then saved in a database text file. 8.2 Synthesis The database file was run through a window of a certain size. For every string of tones in the window, the following tone was stored, resulting in a statistical model of all window-sized strings and their successors. This model was used as source for a Markov chain implementation of any chosen order, producing new tone sequences. The output was purely based on the database, making it unique but still with structural similarities to the input. By modifying certain parameters, the characteristics of the synthesised music could be altered. 8.3 Application screenshots Screenshots of the three different steps of the analysis and the Markov application can be seen in Figure 8.1Figures 8.1 through 8.4. 57 Results Björkvald, Svensson 2004 Figure 8.1: Step 1 of the application. Figure 8.2: Step 2 of the application. 58 Results Björkvald, Svensson 2004 Figure 8.3: Step 3 of the application. Figure 8.4: Markov application. 59 Conclusions Björkvald, Svensson 2004 9 Conclusions As a measure of the success of the thesis work, the objectives can be recapitulated: “The purpose of this thesis is to generate new music from existing sound material by using frequency analysis and the statistical properties of the analysed information. (…) What this means is that the new music will in fact be based on the characteristics of all the different input music, but will still be something completely unique.” Comparing the finished application with the declared objectives, it is apparent that they are fulfilled. Although some limitations have been applied, the final result is new music, based on the analysis of all input. However, due to the thesis nature, limitations may not always be synonymous with failures. Problems in some areas were expected from the beginning, and all insight gained is actually a part of the work. One aspect, which may not have been as predicted, is the distinct separation of the thesis into two different parts, showing themselves hard to combine. The analysis part proved itself the major one, consuming most of the project time. In fact, it was complex and interesting enough to form a thesis of its own. But since there was the synthesis matter left to deal with, the time was not sufficient to realise the true potential of the wavelet analysis. 9.1 Problem formulations Another way of confirming the thesis result is to return to the fundamental problem questions, defined in the very beginning, and see how well they are answered: What sort of information is possible to extract from an arbitrary piece of recorded music? The main information retrieved is what tones are played, and when they are played. While this offers no possibility to estimate the tempo, it gives the position in time and the length of all tones, which in a way can be seen as tempo. No beat detection is performed, and drum sounds are not accounted for in the analysis. Harmonics can be found, but they are not stored. The essence is to find the fundamental tones forming the melody of the music. How must the extracted information be represented to be storable? Since the focus is put on the melody, it is saved as a text string, simply noting the order in which the tones were played. Along with this, the length of each tone is stored, for synthesis purposes. All analysed input is stored in the same text file, forming a database of the available information for the synthesis. How can the stored information be used to synthesise new material? A statistical analysis of the sequences in the database file is performed, in order to make a single model of all patterns of tones and their possible successors. This way, all input characteristics are used to form the output, which is completely synthesised from the model. During the synthesis, the stored lengths are used to decide the lengths of the new tones. The thesis work has managed to answer the problem questions on which it was built, although the answers are in some cases simplified compared to what was expected. 60 Conclusions Björkvald, Svensson 2004 9.2 Limitations Wild ideas are in some ways the necessary fuel for the inspiration when carrying out an experiment like this thesis. But as always, the circumstances will sooner or later bridle the enthusiasm. There was simply not enough time to implement even nearly as advanced features as were desired from the start. Limitations had to be applied to the objectives, some of them more prominent than others in the final application. 9.2.1 No storing of simultaneous tones The Markov model poses a major drawback when working in the context of music; the lack of methods for analysing and synthesising concurrent events. Thinking of the complexity of music in general, it is quite obvious that simultaneous tones are more of a rule than an exception. To somewhat compensate for this, the application treats concurrent tones as if they are following each other. In reality, tones rarely start at the exact same time instant. Even when playing chords there is likely to be a small gap in time between the start of each tone. And since all tones in a (not too experimental) song are supposed to harmonise with each other, treating simultaneous tones as being played one at a time does not corrupt the content of a song. It may slightly change the melody though. Not being able to synthesise simultaneous events is a larger problem. To only create output where no tones are concurrent can never produce anything sounding more advanced than a simulation of a human playing a simple monophonic tune on for example a piano. 9.2.2 No instrument identification or separation Identifying and separating instruments proved to be a task far more challenging than expected. The only information found on this subject involved machine learning – training the program in recognising the sound of certain instruments. As one of the fundamental ideas behind the thesis was to provide the program with sound files and nothing else, this (in itself interesting) AI approach would be a step in the wrong direction. Besides, using a technique like machine learning would still mean that the program would just be able to identify instruments that it had been trained to recognise. Only generalised approaches were of any interest, as the application was not supposed to “know” anything about the input. All in all, the sum of the limited amount of time available and the lack of well-known methods for instrument identification led to these ideas being abandoned. 9.2.3 MIDI for playback Although MIDI can be a powerful tool for musicians, the standard synthesiser on a regular inexpensive soundcard can never reproduce MIDI sounds in such a way that they sound anything like real instruments. Playing the output from the Markov chain as MIDI does not impress the listener in any way. However, due to the fact that no instrument identification has been performed, what is being played is unknown and a cheap synthesiser is as good replacement for an arbitrary instrument as any “perfect” simulation of organic sounds. 9.2.4 No beat detection Looking at the tone lengths and the general structure of the input, the tempo can fairly easily be found. But as soon as more input is to be analysed, there are problems if the tempo is not the same. How should the tempos be combined into one value for the output? Should the new tempo simply be the mean of the inputs, or should it matter which of the inputs that have contributed the most to the output? If there are changes in tempo somewhere in any input, should its tones be separated or should the mean value be used in the analysis? There were a lot of uncer61 Conclusions Björkvald, Svensson 2004 tainties surrounding these matters and in the end, a completely different approach was used – looking at the lengths of individual tones and using them to construct the tempo of the new synthesised music. 9.2.5 Combining different inputs In order to be able to synthesise output that is a combination of all input, they have to be in the same range since the Markov chain does not allow for leaps that were not present in the input. For example, the sequences [ c4d4e4c4 ] and [ d4c4e4d4 ] are not possible to combine using a chain of third or second order. They can be used together if the chain is of first order, but on the other hand this is rarely of any interest. Seeing that input sometimes can be difficult to combine even when the melodies are played in the same octave, it can easily be understood that there will problems if some input is played in a completely different octave. This is not accounted for in the application, and using inputs that differ much in frequency will most likely result in an output comprised of only tones from one of the inputs. 9.3 Thesis separation The major problem with the thesis separation is the link between the two parts. Both of them are powerful, but unfortunately combining them means losing some of the possibilities. The most obvious example of this is the loss of information found in the wavelet analysis, i.e. it is not possible to make full use of the ability to actually find simultaneous tones and harmonics. Due to this, the analysis looks like the more successful of the two parts at the first glance. Further development of it could form a very useful wav2midi-converter, fully able to convert not only monophonic melodies, but also simultaneous tones and chords played. Practically, this would mean utilising a quantisation of the coefficient results from the analysis, rather than the binary thresholding, which removes all information about the actual strengths of the found tones. The Markov chain implementation is also very useful on its own, especially since it offers the user a way of dynamically changing the order of the chain. The major problem however, is that the structure of the Markov chain favours the use of strings and is generally text-based in its appearance. Applying it to language and text synthesis rather than the music representation of this thesis would probably prove its true strength. Since the analysis part mainly expresses itself with correlation coefficients and numbers, there is an apparent communication issue causing problems. Perhaps using a simpler method could have performed a just as suitable analysis for this cause. With the Markov synthesis only using monophonic input material, the wav2midiprinciples could have been adequate. 9.4 Artificial intelligence aspects In order to enjoy the music created with the application, and avoid having a machine-like feel to the output, certain AI factors have proved themselves very important. The focus during the planning of the project was put on the pure analysis and structuring of sound information, but when the synthesis actually started to produce listenable results, the characteristics of the output needed to be considered. What makes a melody sound like it is played by a human being? Originally, the idea was that the analysis would provide the synthesis with all necessary information for it to create an output that sounds “real”, without having to modify any parameters of the synthesis. However, since the analysis results had to be somewhat limited, some parameters had to be introduced in the synthesis to make up for the lost information. By altering these, the output can take on different forms, giving the user a possibility of creating something very machine-like or human, whichever is desired. This aspect took the thesis into yet another major research field; simulation of human behaviour. 62 Conclusions Björkvald, Svensson 2004 9.5 Music theory aspects The plan originally was to minimise the amount of music theory used in the thesis, and approach the problem from a very mathematical point of view. These plans have not been abandoned in the final application, although some bits of music theory were necessary to implement: Tone frequencies During the wavelet analysis, the idea is to analyse signals with respect to their frequency content. To make this relevant for music, all tones searched for during the analysis are translated into pure frequencies. This requires the analysis to use the normal western tone notation of twelve semitones in each octave, and may restrict the application from being used in more experimental music surroundings. To start and end the synthesis on the same tone This may seem like a minor detail, but implementing the fact actually makes the output sound more like a real melody. This is because it is very uncommon to end melodies on a different note than the one they started with. Although no information is stored with regards to chords or harmonics found in the analysis, the application very rarely assembles notes that fit badly together. Storing the found tones in time order solves this. Chords are not stored as a unit, but as individual tones. However, since the individual tone’s successors become the other tones of the chord, they are virtually guaranteed to sound well together. This is reflected in the synthesis, since no other successor tones than the ones stored from the analysis are selected. One major drawback from not using any tempo or beat detection in the analysis is that the synthesis often produces quite unrhytmical melodies. This is something that a more thorough analysis based on theoretical tempo knowledge possibly could have solved. 9.6 Final comments Generally, the idea of music synthesis based purely on a thorough analysis of input material, and especially the idea of being able to affect the compositions only by selecting the input and setting certain parameters, is a mouth-watering prospect. It offers anyone the chance of creating music, without having any criteria for talent or musical knowledge. While the final result of this thesis might not offer this sort of complex composition possibility, it clearly presents the high potential of the principle. Expanding the application to make it able to make use of polyphonic tones and melodies would mean a lot of work, but this thesis and the future work proposed by it could be a good starting point. The work has involved more or less deep insights into a number of different subjects; music theory, DSP, AI, mathematics and so on. While this fact poses limitations, mostly due to insufficient time resources (i.e. each field could have been explored more), it also leads to a vast number of possible future implementations. The realisation of these features would most definitely take the thesis one step closer to the ultimate aim: an artificial “hit-maker”. 63 Future work Björkvald, Svensson 2004 10 Future work This chapter states possible extensions and ideas for future work gained during the course of the thesis work. 10.1 Improving performance • The most apparent solution for improving performance of the application would be to translate the code into a precompiled programming language, tentatively C or C++. The reason for not doing this directly during the process of the thesis is that the Matlab language contains predefined methods for a number of the most important steps of the application; sampling, Fourier transforms, wavelets, matrix handling and so on. Using another language would require all of these functions being written, taking up a lot of valuable time. • Another important improvement could be to separate input files into smaller parts, and then perform the wavelet analysis at smaller intervals. Some sort of interpolation could be performed, filling out the analysis results in between the different parts. This would in principle remove the upper limit (hardware dependent) for the size of the files being analysed, and could also possibly improve the speed of the analysis. 10.2 Improving the features of the analysis • The perhaps most important improvement of the analysis would be to implement a proper instrument identification. By being able to separate the analysis, each instrument could be analysed and synthesised individually. If the MIDI format is kept, this would mean that the synthesis could assign the correct instruments to the new melodies, creating a full soundscape rather than a single monophonic melody. • Proper instrument identification would also allow drums to be properly analysed, putting more effort into tempo and beat tracking. This may result in the beat and drum synthesis forming a separate part of the application. The human voice is also an interesting feature; what if it was possible to analyse the backing music and the sung melody individually? This way, the music and the vocal melody could also be synthesised in order to match each other, along with the drum patterns. • Since the analysis is able to find simultaneous tones, a natural progression would be the ability to store the chords, not only singular tones at a time. This would move the application even further away from its monophonic nature, and make it employ more of the chord and scale theories. • A quantisation of the CWT coefficients is already mentioned in sections 7.2.7 and 9.3. By determining the strengths of all tones, fundamentals as well as harmonics, the analysis could in a simple way be developed to be a powerful wav2midi-application, since the harmonic features of the tones could be utilised. It would also offer the possibility of synthesising MIDI based on the strength of the tones in the input, rather than just randomising the velocity variable, and thereby obtaining a more successful simulation of human behaviour. • Another way of improving the analysis could be to extend it so that it also incorporates looking at the signal in the time domain. Sometimes information about a tone’s duration and location in time is easier to derive from this domain than from the time-frequency domain repre- 64 Future work Björkvald, Svensson 2004 sentation of the wavelet transform. A combination could be a way of optimising the analysis results. • The final application can only deal with one-channel sound files, i.e. mono audio. Of course, stereo handling is a concern for future work. Should the analysis be performed for each separate channel or should the channels be merged? Analysing them separately could extend the synthesis, but exactly how should the channel information be used? • A major issue is the choice of transform. The CWT implementation of this thesis is a sort of mixture of the CWT and the DWT, i.e. it performs a complete analysis, but only at selected scales, and it downsamples the signal according to the frequency content of interest. However, the pure DWT may be an alternative, if an equally efficient way of reconstructing signals from the (fewer) coefficients can be found. 10.3 Extending the statistical model • In order to reduce the communication problem between the analysis and synthesis parts, the possibility of storing complete chords would be useful. To do this, the Markov model needs to be able to handle several tones at a time. The relationship and statistics of not only single tones, but also full chords would then serve as a source for the synthesis. Again, this would also allow for more “proper” music theory being used, i.e. different scales and their corresponding chord structures. • The statistical model could also be improved by transposing the input, so that a connection between all sequences is guaranteed. The possibility of merging music played in completely different octaves and/or keys has to be weighted against the loss of important characteristics. • Another way of looking at the storing of analysed information would be to store the actual CWT coefficients directly, and thereby avoiding the “stringification” of all analysis results. This, along with a statistical model better suited to this type of representation than the Markov chain, which is quite dependant of text-based data, may change the synthesis part completely. • To generate an output having the structure of a traditional song, “intelligence” has to be added to the statistical model. Reoccurring events like verse and chorus has to be found in the statistical analysis, and used in the synthesis. This could involve both a Markov chain, and some other structural analysis method specialised in pattern recognition. 10.4 Enhancing the realism of the synthesis • For simplicity, MIDI was chosen as the format for all synthesis. It is an easy way of representing the synthesised tone sequences, and making them listenable. However, the sound of a MIDI representation can in no way be compared to wave-format audio. Therefore, a more realistic sounding synthesis representation is desirable. One proposal for this, although probably quite advanced, is to perform the synthesis by using an inverse transform directly on the CWT coefficients. Of course, this requires that the synthesis produces an output written as coefficients. If done properly, the coefficients would include all information about harmonics and thereby the most important characteristics of the instrument being played. The MIDI format could then be abandoned for “real” sound waves. The Morlet pseudowavelet used does not meet the admissibility condition, and is therefore not fully invertible. For this type of synthesis to become a reality, there might be a need for another type of mother wavelet. 65 Closing thoughts Björkvald, Svensson 2004 11 Closing thoughts Ever since the first synthesisers arrived, people have always argued about who is really composing the music when a machine is involved in one way or another. If software is used in the production of the music, is it then the user, the programmer or the software itself that is the composer? Issues like these become more and more relevant as computers become more and more a natural part of music creation. For this thesis, questions concerning who or what is making the music are of even larger importance, since a computer is used to create new music, completely based on the compositions of (usually) other people than the user. Where should the line between theft and borrowing and/or gathering inspiration be drawn? This is a common problem in the music industry where people are basically sued on a daily basis for “stealing” parts of other people’s songs. But what if a machine is doing the stealing? Is it all right as long as it cannot easily be heard from where a particular part has been taken? This could mean that it might be okay to use a Markov chain of low order, but not one of higher order. As the order of the chain increases, the output will be more alike the inputs, and when the order becomes high enough, it will in principle be an exact copy of one of the inputs. At this point, the application has more or less created a cover version of another song and it is time to start paying royalties to the composer. Li discusses music creation with machines, and focuses on the almost philosophical question “can a machine ever be considered the composer of a musical piece?” There is no doubt that there are music-making machines out there, the application described in this thesis being one of them. But a machine producing music is not a composer per definition. To be able to answer the question above, there has to be a distinct definition of what music composition really is. According to Li, composing music must involve intelligence to some extent; the composer must be able to make its own choices based on some general knowledge. If the composition system uses ad hoc knowledge directly from the creator it cannot be seen as a composer. In this case, the machine is merely an extension of the builder, not an intelligent entity of its own. [5] But what about a machine that does not have any knowledge whatsoever and is unable to learn? In the application accompanying this thesis, all decisions are based on the statistics of previous decisions made by various composers. Creating patterns with Markov chains is in a way just a matter of imitating the structure of someone else’s work. This could hardly be seen as composing, could it? Still, given a large enough database of inputs and using a Markov chain of relatively low order, the output could, at least in theory be something completely unique. If a machine creates musical pieces that no one has heard before, is it not a composer then, intelligent or not? Li claims that that a machine can only be said to make music if the user is unrelated to the builder and if the machine is autonomic [5]. This is true for the Markov-based music application, if used by an arbitrary person. Admittedly, the software is not completely autonomic in the sense that the user provides the input and can decide the order of the chain. However, the process of combining tone sequences is made out of control from the user, and seeing it that way, the application is autonomic. So, the program is fully capable of making music, but can it compose? Perhaps the answer is no. Perhaps it is a mockery to all hard working talented composers out there to even suggest that a machine in itself could be a composer. But then on the other hand, is it the music or who composed it that is important? As soon as an artistic work of any kind is made available to a wider audience, it does not belong to the creator anymore. Sure, he/she can still receive royalties for it, but since anyone is free to interpret the work, it is out of control for 66 Closing thoughts Björkvald, Svensson 2004 the creator from there on. The harsh truth is that vision he/she had for the work has become unimportant, no matter how interesting it was. There was a time when making music with machines was considered of less artistic value than using “real” organic instruments. Times have surely changed since, and nowadays only musical elitists really care about how the music was made. The possibility of seeing the name of a computer or software as composer for a musical piece in the near future is not science fiction at all. Whether the application created in conjunction with this thesis can compose or not can be left for the reader to decide. In any way, it does not create anything more advanced than monophonic melodies. For this to ever be released on a record, a human being needs to include it in a larger arrangement, possibly containing percussion and vocals. Furthermore, the program does not really capture the structure of choruses and verses. At best, it can create a melody line, which can be used in a song put together by a real person. This could mean that the application and a human in combination serve as composer. While this is not the most advanced software available, there is no other well-known program that poses any concrete threat to human composers. The day a machine can produce complete songs, ready for release to the record-buying audience, it might be time to start thinking about another line of work. But until then (and probably far beyond), there will always be a need for the unique machinery of the human ear and mind in the creation of musical pieces. After all, to the best of all knowledge, no man-made machine will ever fully understand the complex emotions behind all great works of art. 67 Bibliography Björkvald, Svensson 2004 12 Bibliography 12.1 Literature [1] Alm, J. F., Walker, J. S. (2002). Time-Frequency Analysis of Musical Instruments, SIAM Review Vol. 44, No. 3. [2] Bohn, D. (1997). Signal Processing Fundamentals, Rane Corporation. [3] Jehan, T. (1997). Musical Signal Parameter Estimation, Berkeley University. [4] Kamen, E., Heck, S. (1997). Fundamentals of signals and system using MATLAB, Prentice-Hall. [5] Li, T-C. Who or What is Making the Music: Music Creation in a Machine Age, Faculty of Music, McGill University. [6] Mosley, B. (2002). Audio to MIDI conversion, University of Derby. [7] Nowak, R. (2003). Fast Convolution Using the FFT, The Connexions Project. [8] Petterson, R. (2001). Föreläsningsanteckningar om Markovkedjor, Växjö Universitet. [9] Rauterberg, M. Why and what can we learn from human errors?, Advances in Applied Ergonomics, West Lafayette, USA Publishing. [10] Sadowsky, J. (1996). Investigation of Signal Characteristics Using the Continuous Wavelet Transform, Johns Hopkins APL Technical Digest, Volume 17, Number 3. [11] Self G. (2001). Wavelets for Sound Analysis and Re-Synthesis, University of Sheffield. 12.2 Web Last visited 2004-04-28 unless otherwise stated. [12] Chapman, D. Multiana. Last visited 2003-10-10, no longer available. http://www.met.rdg.ac.uk/~chapman/spectrum/ [13] Hansper, G. An introduction to MIDI. http://crystal.apana.org.au/ghansper/midi_introduction/contents.html [14] Jacques, L., et al. Yet Another Wavelet ToolBox (YAWTB), Institut de Physique Théorique. http://www.fyma.ucl.ac.be/projects/yawtb/ [15] Keren, Y. Data Structures & Algorithms Toolbox. http://www.mathworks.com/matlabcentral/fileexchange/loadFile.do?objectId=212&objectTyp e=file [16] Kesteloot, L. Markov chains. http://www.teamten.com/lawrence/projects/markov/ 68 Bibliography Björkvald, Svensson 2004 [17] Kieft, B. A brief history on wavelets. http://www.gvsu.edu/math/wavelets/student_work/Kieft/Wavelets%20%20Main%20Page.html [18] Koniaris, K. Understanding Notes and their Notation. http://koniaris.com/music/notes/ [19] Lipscomb, E. Introduction into MIDI. http://www.harmony-central.com/MIDI/Doc/intro.html [20] The MathWorks Inc. Complex Morlet Wavelets: cmor, Matlab Wavelet Toolbox documentation. http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/ch06_a37.shtml#40178 [21] The MathWorks Inc. cwt, Matlab Wavelet Toolbox documentation. http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/cwt.shtml [22] The MathWorks Inc. scal2freq, Matlab Wavelet Toolbox documentation. http://www.mathworks.com/access/helpdesk/help/toolbox/wavelet/scal2frq.shtml [23] Mathworld. Convolution, Wolfram Research Inc. http://mathworld.wolfram.com/Convolution.html [24] Maurer IV, John A. A Brief History of Algorithmic Composition. http://ccrma-www.stanford.edu/~blackrse/algorithm.html [25] Multimedia Education Group. Text Synthesis. http://www.meg.uct.ac.za/downloads/VBA/textgen.htm [26] The International MIDI Association. Standard MIDI-File Format Spec. 1.1. http://www.pgts.com.au/download/txt/midi.txt [27] Miranda, E.R An introduction to music and Artificial Intelligence. http://website.lineone.net/~edandalex/ai-essay.htm [28] Mugglin, S. Music Theory for Songwriters. http://members.aol.com/chordmaps/ [29] O'Connor, J.J, Robertson, E.F. Ingrid Daubechies. http://www-gap.dcs.st-and.ac.uk/~history/Mathematicians/Daubechies.html [30] Recognisoft. Wav to MIDI conversion software - Solo Explorer. http://www.recognisoft.com [31] Russell, S. Introduction to AI – a modern approach. http://www.cs.berkeley.edu/~russell/intro.html [32] Valens, C. A really friendly guide to Wavelets. http://perso.wanadoo.fr/polyvalens/clemens/wavelets/wavelets.html 69