Full Text - J
Transcription
Full Text - J
Journal of Signal Processing, Vol.17, No.2, pp.29-38, March 2013 PAPER PAPER Sparseness Criteria of F0-Frequencies Selection for Specmurt-Based Multi-Pitch Analysis without Modeling Harmonic Structure Daiki Nishimura, Toru Nakashika, Tetsuya Takiguchi and Yasuo Ariki Graduate School of System Informatics, Kobe University, Kobe 657-8501, Japan E-mail: {nishimura, nakashika}@me.cs.scitec.kobe-u.ac.jp, {takigu, ariki}@kobe-u.ac.jp Abstract This paper introduces a multi-pitch analysis method using specmurt analysis without modeling the common harmonic structure pattern. Specmurt analysis is based on the idea that the fundamental frequency distribution is expressed as a deconvolution of the observed spectrum by the common harmonic structure pattern. To analyze the fundamental frequency distribution, the common harmonic structure needs to be modeled accurately because it is often unknown while the observed spectrum is known. It is considered impossible, however, to obtain a highly accurate model of the structure since it can vary slightly depending on the pitch. Therefore we propose a method to analyze the fundamental frequency distribution without modeling the harmonic structure. We note that each peak of the observed spectrum indicates the fundamental frequency or the harmonic tone. Hence, the fundamental frequency distribution can be regarded as the set which only has the peaks corresponding to fundamental frequencies. To find the set, we prepare many sets of the peaks of the spectrum, and obtain a large number of common harmonic structures. We evaluate the sparseness of these structures using L1 or L2 norm, and then select the set that has derived the sparsest structure as a solution. The experimental result shows the effectiveness of the proposed method. Keywords: multi-pitch analysis, sparseness criteria, specmurt analysis 1. Introduction In recent years, music information processing technology has improved dramatically. This gives us many chances for creating music. For example, in the past only those who had specific musical skills could compose or arrange music, but now, anyone can enjoy these activities by using various music-related software. However, there still remain some fields that rely on people with specific skills, such as perfect-pitch. This ability is necessary when attempting to reproduce or score music by simply hearing it, and considerable experience and effort are needed in order to acquire this skill. In particular, it is difficult to analyze the signal that has tones of a different pitch at the same time. Therefore, a technology for analyzing multi-pitch signals is required. Monophonic music can be analyzed with relatively a high accuracy [1]-[4]. However, multi-pitch music is more difficult to analyze than a single tone. An acoustic signal has information of fundamental frequencies and harmonic frequencies, but in the case of Journal of Signal Processing, Vol. 17, No. 2, March 2013 multi-pitch sounds, it is unknown which peak corresponds to the fundamental frequency or the harmonic frequency. Moreover, the number of fundamental frequencies is not always known. This is one reason for the difficulty of multi-pitch analysis. Many techniques have been tried in multi-pitch analysis in the past, such as a comb filter [5], statistical information of chords and their progression [6, 7], iterative estimation and separation [8], linear models for the overtone series [9], parameter estimation of superimposed spectrum models [10, 11], acoustic object modeling using GMM and estimation with an EM algorithm [12]-[14]. Specmurt analysis [15]-[21] is another method of multi-pitch analysis. The method defines the observed spectrum as a convolution of the fundamental frequency distribution and instrumental information, and it differs from those listed above in terms of the introduction of the specmurt domain while [5] is processed in the time domain and [6]-[14] are processed in the spectrum domain. The conventional specmurt analysis is based on the approach that first obtains instrumental information 29 2 Journal of Signal Processing, Vol. , No. , ω ∆ω 2∆ω 3∆ω 4∆ω ω x ∆x ∆x ∆x ∆x x (a) Linear frequency dom- (b) Log-frequency domain ain Fig. 1 Positional relationship between fundamental and harmonic frequencies h(x) u (x) v(x) Common harmonic structure pattern x1 x2 Fundamental frequency distribution x1 x2 Generated multi-pitch spectrum Fig. 2 Generation of a multi-pitch spectrum by convolution of a common harmonic structure and a fundamental frequency distribution [18] by iteratively generating a model called the “common harmonic structure” and then gives a fundamental frequency distribution based on the model. This method builds a common harmonic structure, and the approach is based on the premise that the relative powers of the harmonic components are common and do not depend on the fundamental frequency. However, it is considered impossible to obtain a highly accurate model of the structure since it can vary slightly depending on the pitch. Because of the dependency of the harmonic structure on the pitch, the data-driven approach to select the harmonic structure without assuming the common harmonic structure is needed. Therefore we propose a new method based on the sparseness criteria to analyze the fundamental frequency distribution. 2. Specmurt Analysis 2.1 Multi-pitch spectrum in log frequency In our study, the acoustic signals having harmonics are analyzed and percussive signals such as drums are not targeted. The n-th harmonic frequency is equal to n multiple of the fundamental frequency in the linearfrequency scale. Therefore, when the fundamental frequency shifts by ∆ω, the n-th harmonic frequency also shifts by n × ∆ω (Fig. 1(a)). Meanwhile, in the logfrequency scale, the n-th harmonic frequency is located at log n away from the fundamental frequency. 30 This means that all harmonic frequencies shift by ∆x when the fundamental frequency shifts by ∆x in the log-frequency scale (Fig. 1(b)). In specmurt analysis, it is assumed that the relative powers of the harmonic components are common and do not depend on the fundamental frequency. This is called common harmonic structure h(x), where x represents log-frequency. The fundamental frequency is located at the origin, and the power is normalized to be 1. All pitch spectra can be expressed by a shift of h(x) along the x-axis in the log-frequency domain when a fundamental frequency in the log-axis is given. It is considered that a multi-pitch spectrum can be generated by addition of a common harmonic structure h(x) multiplied by the power corresponding to the fundamental frequency. If the distribution of the power of fundamental frequencies is defined as a fundamental frequency distribution u(x), a multi-pitch spectrum v(x) is a convolution of h(x) and u(x), as shown in Fig. 2. v(x) = h(x) ∗ u(x) (1) 2.2 Analysis of fundamental frequency distribution If a common harmonic structure h(x) is known, a fundamental frequency distribution u(x) can be estimated by the deconvolution of an observed multi-pitch spectrum v(x) by h(x) u(x) = h(x)−1 ∗ v(x) (2) According to the convolution theorem, Eq. (2) can be expressed as V (y) U (y) = (3) H(y) where U (y), H(y) and V (y) are the inverse Fourier transform of u(x), h(x) and v(x), respectively. We can obtain u(x) using the Fourier transform of U (y) in the y domain as follows: u(x) = F[U (y)] (4) As described above, the method to estimate the fundamental frequency distribution by deconvolution in the log-frequency domain is called specmurt analysis [15]-[21], and the y domain (defined as the inverse Fourier transform of the log-frequency spectrum) is called the specmurt domain. In practical calculation, the y domain may be regarded as the Fourier transform. In specmurt analysis, a wavelet transform that can perform an analysis in the log-frequency is used to extract spectra instead of the short-term Fourier transform since the observed spectrum v(x) is dealt with in the log-frequency domain. One characteristic of specmurt analysis is that it can analyze music signals where pitch changes occur Journal of Signal Processing, Vol. 17, No. 2, March 2013 3 Observed Spectrum Generation of Candidates for the Fundamental Frequency Distribution Calculation of Harmonic Structure Using Specmurt Rejection of Non-Harmonic Structure Fig. 3 Example of observed spectrum (piano triad) Finding the Optimal Harmonic Structure Based on the Sparseness in a short time. Therefore, the analysis result of a piano roll, for example, can be obtained as visual information, where the horizontal axis represents the time index and the vertical axis represents the pitch. 2.3 Conventional approach with specmurt The fundamental frequency distribution u(x) can be obtained using Eq. (2) if the observed spectrum v(x) is given, and the common harmonic structure h(x) is known. However, h(x) is generally unknown. For this reason, h(x) has been modeled in some ways so far, and the model is assumed as follows. In [15, 16], the common harmonic structure whose power ratio of the n-th harmonic frequency component is 1/n of the fundamental frequency component is defined. This is based on the previous knowledge that a natural sound spectrum commonly has such a shape. However, the optimal fundamental frequency distribution u(x) is not always obtained by such an approach since the common harmonic structure varies depending on the tone. In [17, 18, 21], a quasi-optimization with an iterative algorithm is used for estimating h(x) but a more accurate modeling method is required. 3. 3.1 Correct Fundamental Distribution Fig. 4 Flowchart of sparseness criteria of F0frequencies selection for specmurt-based multi-pitch analysis 3.2 Outline of proposed method If there is no noise in the observed spectrum v(x), it is believed that each peak corresponds to the fundamental frequency or the harmonic frequency. Fig. 3 shows the example of a spectrum of a piano triad. It is considered possible to obtain the fundamental frequency distribution u(x) by selecting the set of the peaks in the observed spectrum correctly. Hence, our method focuses on finding the fundamental frequency distribution ũ(x) that has all peaks corresponding to the fundamental frequencies of multiple tones and does not have any peaks corresponding to the harmonic frequency components. Fig. 4 shows the flowchart of our method. First, some candidates for the funda- Sparseness-Based F0 Selection Problem with modeling of the common harmonic structure As mentioned in the previous chapter, the conventional multi-pitch analysis with specmurt focuses on how to model the common harmonic structure h(x). However, it is considered difficult to obtain the strictly correct model of the structure since it is known that the harmonic structure slightly varies depending on the pitch. Therefore, we propose the method to analyze the fundamental frequency distribution without modeling the harmonic structure. Journal of Signal Processing, Vol. 17, No. 2, March 2013 Fig. 5 Example of generation ûi (x) from the observed spectrum 31 4 Journal of Signal Processing, Vol. , No. , Fig. 6 Examples of ûi (x) generated from Fig. 3 (upper row) and ĥi (x) corresponding to each ûi (x) (lower row) Fig. 7 Example of harmonic structure (single tone A3 of piano) mental frequency distribution are generated from the observed spectrum. Second, using specmurt, the harmonic structures corresponding to the candidates are calculated, respectively. In the obtained structures, non-harmonic structures are rejected, and the optimal harmonic structure is found based on the sparseness among the remaining structures. Finally, the candidate corresponding to the optimal harmonic structure is selected as the correct fundamental frequency distribution. 3.3 Generation of candidates for fundamental frequency distribution It is difficult to extract the peaks of fundamental frequencies exclusively from the observed spectrum since it cannot be said which peak corresponds to the fundamental frequency or the harmonic frequency components. Therefore, we will first discuss the candidates for ũ(x). 32 It is known that the peaks corresponding to the fundamental frequencies have a certain level of power. We consider M major peaks from the observed spectrum and obtain some sets of û(x) by selecting some combinations from the different M major peaks. If the observed signal consists of L tones, the number of candidates is calculated by M CL because the number of peaks of ũ(x) should be equal to the number of tones. However, the number of tones is often unknown. ∑ Thus, the number of candidates λ is exL pressed as l=1 M Cl so that it can calculate up to L tones from single tone. Fig. 5 shows an example of generation ûi (x) from the observed spectrum. The candidates of ũ(x), ûi (x) (i = 1, 2, . . . , λ) have a combination of peaks obtained from the observed spectrum. Each peak is processed as an impulse that has the power equal to the corresponding peak. 3.4 Selection of optimal harmonic structure 3.4.1 Calculation of harmonic structure using specmurt A harmonic structure is obtained according to Eq. (1) as follows: h(x) = u(x)−1 ∗ v(x) (5) One solution for the harmonic structure, ĥi (x), is obtained by substituting u(x) in Eq. (5) with the candidate ûi (x). In this section, we discuss how to select an optimal harmonic structure h̃(x) among ĥi (x) (i = 1, 2, . . . , Journal of Signal Processing, Vol. 17, No. 2, March 2013 5 λ). The figures in the upper row of Fig. 6 illustrate examples of ûi (x) generated from a piano triad (Fig. 3), and those in the lower row show ĥi (x) corresponding to each ûi (x). Fig. 6(a-1) shows a candidate ûi (x) that has all the fundamental frequencies and does not have any harmonic frequency; i.e., the correct combination of spectral peaks ũ(x). Fig. 6(b-1) and Fig. 6(c-1) are examples of incorrect combinations of spectral peaks, where they lack some fundamental frequencies or have some harmonic frequencies. Fig. 6(a-2) shows ĥi (x) corresponding to the correct combination of spectral peaks ũ(x); i.e., the optimal harmonic structure h̃(x). As shown in these figures, h̃(x) is the most similar to the harmonic structure (Fig. 7) among those on the lower row of Fig. 6. On the other hand, the harmonic structure in Fig. 6(b-2) has numerous peaks, though that in Fig. 7 does not have peaks at the same position. Also, Fig. 6(c-2) does not have large peaks in the harmonic frequencies. The structure like Fig. 6(c-2) is called the non-harmonic structure in this paper. where α represents the weight. If L1 norm is used in the first and the second terms, La (i) = X ∑ {1 − x=1 Lb (i) = N ∑ j=1 X ∑ N ∑ x=1 j=1 3.4.3 Finding the optimal harmonic structure based on the sparseness An ideal harmonic structure has peaks only in the fundamental frequency and the harmonic frequencies as in Fig. 7. In our method, in order to select the optimal harmonic structure, we calculate the sparseness of each ĥi (x) that is not rejected, as described in Section 3.4.2. According to Fig. 7 and Fig. 6, the optimal harmonic structure h̃(x) is considered to be sparser and has larger peaks at the harmonics than other ĥi (x). Thus, the sparseness S is defined as S(i) = −{αLa (i) − (1 − α)Lb (i)} δ (Ωj − x) |hi (x)| (7) (8) where δ is the Kronecker’s delta. The first term, La (i), means the sparseness (except for harmonic components), and the second term, Lb (i), means the summation of values at harmonics. If L2 norm is used in Eq. (6), La (i) = X ∑ x=1 Lb (i) = {1 − N ∑ j=1 X ∑ N ∑ x=1 j=1 3.4.2 Rejection of non-harmonic structures In order to reduce the computation cost for finding the optimal harmonic structure which is described in the 3.4.3, non-harmonic structures are rejected in advance using a technique described in this section. If the instrument or the pitch varies, the relative power ratio of each harmonic frequency varies but each appearance position of harmonic frequencies does not vary. Therefore, the appearance position of harmonic frequency in the harmonic structure (Ω2 , Ω3 , . . . , ΩN ) is regarded as the information that is independent of the pitch, where Ωn represents the position of the nth harmonic component, and Ω1 represents the origin position for the fundamental frequency. Based on this information, it is important to check whether there are values at (Ω2 , Ω3 , . . . , ΩN ). For example, any structure that does not have any large peaks at (Ω2 , Ω3 , . . . , ΩN ) like Fig. 6(c-2) is treated as a nonharmonic structure, and such structures are rejected by using the threshold set experimentally. δ (Ωj − x)}|hi (x)| δ (Ωj − x)}hi (x)2 δ (Ωj − x) hi (x)2 (9) (10) Assuming that h̃(x) = ĥĩ (x), ĩ can be determined by ĩ = argmax S(i) (11) i 3.5 Correct fundamental frequency distribution As described above, the optimal harmonic structure h̃(x) is obtained based on sparseness criteria. Finally, ũ(x) corresponding to h̃(x) is selected uniquely among ûi (x). Summing up our method, the steps shown below are processed for each frame. 1. Based on the observed spectrum v(x), the candidates of the optimal fundamental frequency distribution ûi (x) are prepared. 2. The candidates of the optimal harmonic structure ĥi (x) are obtained by substituting ûi (x) in Eq. (5). 3. Non-harmonic structures are rejected among ĥi (x), and the most sparsest ĥi (x) is determined as h̃(x). 4. ũ(x) corresponding to h̃(x) is selected among ûi (x). This method does not need to learn the pitch or the instrumental information since each step is independent of pitch and instrumental information. (6) Journal of Signal Processing, Vol. 17, No. 2, March 2013 33 6 Journal of Signal Processing, Vol. , No. , C6 C6 C5 C5 C4 C4 C3 C3 Time Time Fig. 9 An example of analysis result (data A): Red circles indicate some mistaken notes. (a) C6 C5 C4 Time (b) Fig. 8 (a) Piano-roll of test MIDI (data A) and (b) Piano-roll of test MIDI (data B) 4. Experiments 4.1 Conditions To evaluate our method, we use two songs from the RWC Music Database1 as the test data (Table 1), and Fig. 8 shows the piano-roll of data A and data B. The test signal is recorded at a 16kHz sampling rate using MIDI instruments: piano, violin or acoustic guitar. Wavelet transform with Gabor function is applied to the test data to obtain the spectrum. The parameter M described in Section 3.3 is set at 7. This means that we can analyze the observed signal having up to 7 tones at the same time. Next, the parameter N is set at 6 since the value at ΩN tends to be unobservable when N increases and setting too large of an N might cause the rejecting all ĥi (x). Table 1 Symbol data A data B List of experimental data Title Sicilienne Gavotte E-Dur Catalog number RWC-MDB-C-2001 No.43 RWC-MDB-C-2001 No.36 4.2 Results Fig. 9 depicts analysis results of data A, where (L2 , L1 ) and the weight parameter of 0.9 are used for a pi1 http://staff.aist.go.jp/m.goto/RWC-MDB/ 34 ano roll. Almost all the notes are estimated correctly, but some notes are mistaken as octave-different notes. Fig. 10 and 11 show the accuracies of data A and data B for piano, violin, and guitar using our proposed method (without modeling harmonic structure), respectively. For example, (L1 , L2 ) in the figures indicates that the L1 norm is used in the first term of Eq. (6), and L2 norm is used in the second term of Eq. (6). The weight parameter α in Eq. (6) is changed from 0.0 to 1.0. The accuracy is calculated as follows: Accuracy(%) = Nall − (Nins + Ndel ) × 100 Nall (12) where Nall , Nins and Ndel represent the total number of notes, insertion errors and deletion errors, respectively. In our experiments, the note duration is not evaluated, and we permit the onset time to shift τ seconds (in experiments, τ = 0.3) since the onset time and the duration of each tone are not exactly equal to the score. As shown in Fig. 10 and Fig. 11, the optimal parameter varied, depending on the instrument. For piano, the results with large weight indicate higher accuracy (Fig. 10(a) and 11(a)). This means La may work effectively for instruments with frequency structures similar to that of a piano, where the largest peak at the origin (fundamental frequency) is observed, and because the frequency is higher, the peak value is small. For violin, the use of small weight resulted in the higher accuracy (Fig. 10(b) and 11(b)). This means that Lb which calculates the summation of values at the harmonic, may work well for instruments with frequency structures similar to that of a violin, where the structure is different from that of a piano in terms of having a larger peak at the second harmonic than the fundamental frequency. We will need to investigate further the effectiveness of La and Lb in future work. For guitar, Fig. 10(c) shows that the use of middle weight resulted in higher accuracy, and Fig. 11(c) shows the use of small weight resulted higher accuracy. In the case of guitar, the largest peak is observed at the fundamental frequency, similar to Journal of Signal Processing, Vol. 17, No. 2, March 2013 7 (a) Accuracy results for piano (data A) (b) Accuracy results for violin (data A) Fig. 10 (a) Accuracy results for piano (data B) Accuracy results (data A) (b) Accuracy results for violin (data B) Fig. 11 (c) Accuracy results for guitar (data A) (c) Accuracy results for guitar (data B) Accuracy results (data B) piano, but a guitar sometimes produces the peaks of attack sound at the lower frequency than the fundamental frequency. Therefore, occasionally, the attack sound is regarded as the fundamental frequency, and the correct fundamental frequency is regarded as the second harmonic. For that reason, some notes may be regarded as a violin. As a consequence, the optimal weight for a guitar varies depending on the number of the notes regarded as a violin. In all results for data A (except violin), the combination of (L2 , L1 ) resulted in the best accuracy, where L2 norm is used in the first term, La in Eq. (6). In order to increase the value of Eq. (6), La , which is the summation of the noises in the harmonic structure, has to be small, and Lb , which is the summation of the harmonic, has to be large. L2 norm reduces the value of La (the first term in Eq. (6)) better than the use of L1 norm because L2 norm makes the value that is less than 1 more smaller (all noises are smaller than 1). On the other hand, in order to increase the value of Lb (the second term in Eq. (6)), the use of L1 norm is better than L2 because most harmonics are also smaller than 1. Table 2 shows the comparison between the specmurt-based method with modeling of the common harmonic structure [18] and the proposed method in data A, where the optimal parameters are selected in each method. The proposed method (without modeling the common harmonic structure) obtained higher accuracies than that with the common harmonic Journal of Signal Processing, Vol. 17, No. 2, March 2013 Fig. 12 Observed spectrum (multi-pitch D4 and B♭4 of piano) structure for each instrument. Fig. 12 shows an observed spectrum of multi-pitch (D4 and B♭4). Fig. 13(a-1) and Fig. 13(a-2) show the fundamental frequencies and the harmonic structure obtained from the observed spectrum by modeling the harmonic structure, and Fig. 13(b-1) and Fig. 13(b2) show the results gained by the proposed method. The modeled harmonic structure (Fig. 13(a-2)) has no noise, but the fundamental frequency distribution corresponding to it (Fig. 13(a-1)) is incorrect. Some mistaken peaks of the distribution may be eliminated using threshold processing; however, the larger peak 35 8 Journal of Signal Processing, Vol. , No. , Fig. 13 The fundamental frequency distribution and the harmonic structure obtained from Fig. 12 by conventional specmurt (left) and proposed (right) Table 2 Comparison of a specmurt-based method with modeling harmonic structure to the proposed method Piano Guitar Violin with modeling harmonics 89.2% 74.3% 65.0% w/o modeling harmonics 92.7% 79.7% 71.7% tuations. Fig. 14 The harmonic structure of piano (D4) circled in black in Fig. 13(a-1) may not be excluded. On the other hand, the harmonic structure produced by the proposed method (Fig. 13(b-1)) has some small noises, but the fundamental frequency (Fig. 13(b-2)) is correct. The noises in the harmonic structure come from the difference of each harmonic structure D4 and B♭4. Since the noises absorb the difference, the optimal fundamental distribution can be obtained. Fig. 14 shows the harmonic structure of a piano (D4). There are some differences between this structure and Fig. 13(a-2) although Fig. 13(a-2) is the modeled harmonic structure. This may be because it is difficult to model the optimal common harmonic structure in multi-pitch music since the harmonic structure can vary slightly depending on the pitch. In our future work, we will study how to best obtain the optimal common harmonic structure in multi-pitch si- 36 5. Conclusion In this paper, we proposed a specmurt-based, multi-pitch analysis method without modeling the common harmonic structure. Instead of modeling the structure, the optimal harmonic structure is selected among the candidates based on sparseness criteria. The experiments show our method is effective for multi-pitch analysis. The results from Fig. 10 and Fig. 11 indicate that the optimal parameter α varies depending on instruments or music. Since multi-pitch analysis in a real environment deals with some instruments or pitches without instrument information, in our future work, we will study how to determine the optimal parameter. In the future, we will improve the method by adding other criteria to avoid octave difference errors and to make it possible to apply our method to vocal singing harmony. Journal of Signal Processing, Vol. 17, No. 2, March 2013 9 References [1] L. R. Rabiner: On the use of autocorrelation analysis for pitch detection, IEEE Trans. ASSP, Vol. ASSP-25, No. 1, pp. 24-33, 1977. [2] D.J. Hermes: Measurement of pitch by subharmonic summation, Journal of ASA, Vol. 83, No. 1, pp. 257-264, 1988. [3] Y. Takasawa: Transcription with Computer, IPSJ, Vol. 29, No. 6, pp. 593-598, 1988. [4] P. Cuadra, A. Master and C. Sapp: Efficient pitch detection techniques for interactive music, International Computer Music Conference, 2001. [5] T. Miwa, Y. Tadokoro and T. Saito: The pitch estimation of different musical instruments sounds using comb filters for transcription, IEICE Trans. D-II, Vol. J81-D-II, No. 9, pp. 1965-1974, 1998. [6] K. Kashino, K. Nakadai, T. Kinoshita and H. Tanaka: Organization of hierarchical perceptual sounds: Music scene analysis with autonomous processing modules and a quantitive information integration mechanism, Proc. IJCAI, Vol. 1, pp. 158-164, 1995. [7] K. Kashino, T. Kinoshita, K. Nakadai and H. Tanaka: Chord recognition mechanisms in the OPTIMA processing architecture for music scene analysis, IEICE Trans. D-II, Vol. J79-D-II, No. 11, pp. 1762-1770, 1996. [8] A. Klapuri, T. Virtanen and J. Holm: Robust multipich estimation for the analysis and manipulation of polyphonic musical signals, Proc. COST-G6 Conference on Digital Audio Effects, pp. 233-236, 2000. [9] T. Virtanen and A. Klapuri: Separation of harmonic sounds using linear models for the overtone series, Proc. ICASSP2002, Vol. 2, pp. 1757-1760, 2002. [10] M. Goto: F0 estimation of melody and bass lines in musical audio signals, IEICE Trans. D-II, Vol. J84-D-II, No. 1, pp. 12-22, 2001. [11] M. Goto: A real-time music scene description system: Predominant-F0 estimation for detecting melody and bass lines in real-world audio signals, ISCA Journal, Vol. 43, No. 4, pp. 311-329, 2004. [12] K. Miyamoto, H. Kameoka, T. Nishino, N. Ono and S. Sagayama: Harmonic, temporal and timbral unified clustering for multi-instrumental music signal analysis, IPSJ SIG Technical Report, 2005-MUS, Vol. 82, pp. 71-78, 2005. [13] H. Kameoka, J. Le Roux, N. Ono and S. Sagayama: Harmonic temporal structured clustering: A new approach to CASA, ASJ, Vol. 36, No. 7, pp. 575-580, 2006. [14] K. Miyamoto, H. Kameoka, T. Nishimoto, N. Ono and S. Sagayama: Harmonic-temporal-timbral clustering (HTTC) for the analysis of multi-instrument polyphonic music signals, Proc. ICASSP2008 pp. 113-116, 2008. [15] K. Takahashi, T. Nishimoto and S. Sagayama: Multi-pitch analysis using deconvolution of log-frequency spectrum, IPSJ SIG Technical Report, 2003-MUS, Vol. 127, pp. 113-116, 2008. [16] S. Sagayama, K. Takahashi, H. Kameoka and T. Nishino: Specmurt analysis: A piano-roll-visualization of polyphonic music signal by deconvolution of log-frequency spectrum, Proc. ISCA Tutorial and Research Workshop on Statistical Journal of Signal Processing, Vol. 17, No. 2, March 2013 and Perceptual Audio Processing (SAPA2004), to appear, 2004. [17] H. Kameoka, S. Saito, T. Nishino and S. Sagayama: Recursive estimation of quasi-optimal common harmonic structure pattern for specmurt analysis: Piano-roll visualization and MIDI conversion of polyphonic music signal, IPSJ SIG Technical Report, 2004-MUS, Vol. 84, pp.41-48, 2004. [18] S. Saito, H. Kameoka, T. Nishimoto and S. Sagayama: Specmurt analysis of multi-pitch music signals with adaptive estimation of common harmonic structure, Proc, International Conference on Music Information Retrieval (ISMIR2005), pp. 84-91, 2005. [19] S. Saito, H. Kameoka, N. Ono and S. Sagayama: POCSbased common harmonic structure estimation for specmurt analysis, IPSJ SIG Technical Report, 2006-MUS, Vol. 45, pp. 13-18, 2006. [20] S. Saito, H. Kameoka, N. Ono and S. Sagayama: Iterative multipitch estimation algorithm for MAP specmurt analysis, IPSJ SIG Technical Report, 2006-MUS, Vol. 90, pp. 85-92, 2006. [21] S. Saito, H. Kameoka, K. Takahashi, T. Nishimoto and S. Sagayama: Specmurt analysis of polyphonic music signals, IEEE Trans. ASLP, Vol. 16, No. 3, pp. 639-650, 2008. Daiki Nishimura received his B.E. degree in computer science from Kobe University in 2011. His current research interest includes acoustic signal processing. He is a member of ASJ. Toru Nakashika received his B.E. and M.E. degrees in computer science from Kobe University in 2009 and 2011, respectively. In the same year, he continued his research as a doctoral student. From September 2011 to August 2012 he studied at INSA de Lyon in France. He is currently a 2nd-year doctoral student at Kobe University. His research interest is speech and image recognition and statistical signal processing. He is a member of IEEE and ASJ. 37 10 Journal of Signal Processing, Vol. , No. , Tetsuya Takiguchi received his B.S. degree in applied mathematics from Okayama University of Science, Okayama, Japan, in 1994, and his M.E. and Dr. Eng. degrees in information science from Nara Institute of Science and Technology, Nara, Japan, in 1996 and 1999, respectively. From 1999 to 2004, he was a researcher at IBM Research, Tokyo Research Laboratory, Kanagawa, Japan. He is currently an Associate Professor at Kobe University. His research interests include statistic signal processing and pattern recognition. He received the Awaya Award from the Acoustical Society of Japan in 2002. He is a member of IEEE, IPSJ and ASJ. Yasuo Ariki received his B.E., M.E. and Ph.D. degrees in information science from Kyoto University in 1974, 1976 and 1979, respectively. He was an Assistant Professor at Kyoto University from 1980 to 1990, and stayed at Edinburgh University as visiting academic from 1987 to 1990. From 1990 to 1992 he was an Associate Professor and from 1992 to 2003 a Professor at Ryukoku University. Since 2003 he has been a Professor at Kobe University. He is mainly engaged in speech and image recognition and interested in information retrieval and database. He is a member of IEEE, IPSJ, JSAI, ITE and IIEEJ. (Received July 17, 2012; revised January 7, 2013) 38 Journal of Signal Processing, Vol. 17, No. 2, March 2013