Tabla Strokes Recognition
Transcription
Tabla Strokes Recognition
Tabla Strokes Recognition Mihir Sarkar 1 Tabla? The tabla is a pair of hand drums from North India. They are played with the fingers and palms of both hands. The right drum (from a player’s perspective) produces a high pitched sound, whose pitch can be tuned with the rest of the orchestra. The left drum produces bass sounds with varying pitches depending on the pressure applied on the drumhead with the palm of the hand. The tabla can play a variety of different sounds, both pitched and unpitched. Each of these sounds, called bol, has a syllable associated with it. Thus rhythmic compositions can be sung, and transmitted in an oral tradition. 2 Context • Can you distinguish different bols? • Can a machine automatically classify tabla strokes? • Is there a systematic way to identify the best method to recognize tabla strokes? Humans can distinguish different bols after some minimal training. How can a machine classify tabla strokes, and how would it compare with a human? In this work, I study a systematic way to recognize tabla strokes by trying out various parameters to select the optimum features for classification. Tabla strokes recognition can be interesting both for music information retrieval in large multimedia databases, or for automatic transcription of tabla performances. 3 Vision In my case, this work falls in the larger vision of a real-time online music collaboration system for Indian percussion, a system with which 2 tabla players can play with each other over a network. The computer system recognizes the tabla strokes at one end, and sends the recognized bols as symbols instead of an audio stream to the other end, which will then synthesize the appropriate tabla sounds. This will minimize transmission delay. Algorithmic delays can be compensated with a prediction system, and additional machine intelligence. 4 Experimental Setup • • • • • • 1 tabla set 3 tabla players 10 bols 413 recordings (kept 300) Microphone input (studio recording) Discrete strokes I used one tabla set for recordings. 3 players (Manu Gupta, Graham Grindlay, and myself) played 10 of the most common bols on the tabla, 10 times each. I actually got more recordings, but kept 300 so as to have equal priors for each class (i.e. each type of stroke). The recordings were done in a studio environment with low noise. Discrete strokes were played, as opposed to elaborate rhythmic patterns. 5 Raw data These are time-domain waveforms of various strokes, Na on the upper left, Tin (very similar) on the upper right, Ga on the lower left (lower frequency, played on the left drum), Dha--a combination of Na and Ga-in the center, and Ka, a percussive dry sound, at the lower right hand side. 6 Spectrogram The spectrograms give some information about the strokes (each spectrograms is at the position corresponding to the waveforms discussed previously): the Na (upper left) has a clear pitch, whereas the Ka (lower right) looks mostly like noise. But it is difficult to make out actual features from these graphs. However we can see a clear onset where all frequencies are present (white noise), and then the spectrum seems to be relatively steady as it dies down. 7 Feature Extraction • Time domain: ZCR • Frequency domain: PSD • Cepstral domain: MFCC Some of the possible features include zero-crossings (based on the literature), but I quickly dropped it because of the noise level on the recordings, which, although low, created zero crossings throughout each frame. In the frequency domain, I considered the Power Spectral Density (Welch’s method--other algorithms, such as the Periodogram method, performed similarly--with a 3% variance at most). I also considered Mel Frequency Cepstrum Coefficients, widely used for music information retrieval and speech recognition. 8 Dataset Selection • Orthogonal dimensions: – Instances – Bols – Players • Training / leave-one-out validation • Testing The recordings provided me with 3 orthogonal dimensions in my dataset: each instance of a stroke played by each player, the strokes themselves, and the players. I considered an 80/20% break-up of my dataset for training and testing. 9 Baseline • Random: 10% • Human: 87% • Initial k-NN: 18% (Welch’s PSD, NFFT = 16, k = 1) For comparison purpose, the random baseline (with equal priors) gives a recognition rate of 10%. I was also curious as to how a human would actually perform. I performed the test and came out with a score of 87% on the testbed of my dataset. My first run of an automatic classifier gave me 18%, definitely something to work on. 10 k-NN I compared PSD (NFFT = 512) with MFCC (20 coefficients) with the whole set of features (no reduction), and plotted the evidence curve. MFCC seems to perform optimally for k = 1. Using both MFCC and PSD vectors does not improve recognition over MFCC alone. 11 k-NN Then I reduced dimensionality with PCA (to 8 coefficients), and there, MFCC performed worse then PSD. The reason seems to be that MFCC requires all coefficients to represent the data (it spreads out the information evenly), and a smaller set of features may carry too little information for recognition purpose. 12 k-NN With PCA reduction to 16 coefficients, MFCC outperforms PSD for k = 1 but falls quickly for values other than 1. PSD performs more consistently on this range of k (1 to 10). Therefore I decided to continue this study by keeping the PSD feature vector, and dropping out MFCC. 13 k-NN The idea presented previously about MFCC seems to hold as with a PCA reduction to 4, PSD still clearly outperforms MFCC. 14 k-NN With a fixed PSD length of 512, and a PCA reduction to 8 dimensions, I now study the behavior of the classifier when varying the FFT length used in the PSD computation. As seems intuitively acceptable, the higher the resolution, the better the classification. However the rate of increase drops down after 512 points. Therefore the best trade-off between computational requirements and recognition results looks to be NFFT = 512, which is what I keep from now on. 15 k-NN I also study the effect of the frame size on the recognition rate. By default, I chose a value of 500ms for the duration of the samples. I imagine that if I increase the sample length, I might get better results by including more information in the frame. However this graph shows that the 750ms case performs worse than 500ms. The reason could be that many strokes have a short duration; thus the samples do not carry more information after a certain duration, but merely noise. It is interesting to note that when I reduce the frame length to 250ms, the recognition rate improves compared to 500ms, suggesting that most of the information is indeed carried in the first part of the frame. So what if further reduce the frame size (to 100ms)? The results here are abysmal. The reason I propose is that although most of the strokes’ information is carried in the early part of the frame, the initial onset carries all frequencies equally (like white noise--see spectrograms). And only after the onset does the waveform carry discriminatory information. So we could imagine a system which uses the onset to detect the stroke, but ignores that part of the waveform, and then performs recognition on the next couple hundred milliseconds (which is good news for continuous stroke recognition). 16 Confusion Matrix Te Re Tat Thun 0 0 0 0 1 1 0 0 0 0 2 0 0 2 0 0 0 0 0 3 0 1 0 1 0 0 0 2 0 2 1 1 0 0 0 0 0 1 0 1 4 0 0 0 0 Te 0 1 0 0 0 0 1 0 4 0 Na Tin Ga Ka Dha Dhin Na 5 0 0 0 0 Tin 0 3 0 0 Ga 0 0 4 Ka 0 1 Dha 0 Dhin Re 0 0 0 0 0 0 1 4 0 1 Tat 0 0 0 0 0 0 2 1 3 0 Thun 0 1 0 0 0 0 0 0 0 5 The recognition rate for the test dataset based on the previous parameters (k = 1, NFFT = 512, duration = 256ms) performs just above 50%. This confusion matrix (rows are actual classes, columns are estimated classes) is interesting because it gives an idea of what are the confusing strokes for the system. And they do correlate with strokes that are difficult to distinguish for humans. 17 Neural Networks Nodes 20 22 24 36 40 42 46 48 50 52 Validation 53 47 13 41 55 70 40 40 45 72 Testing 5 13 15 10 18 12 6.7 10 10 10 I also tried classification using neural nets (1 hidden layer). This gave very low results compared with k-NN. The reason could be that the number of data points is too low for training. 18 Contributions • Implemented pattern classification algorithms (Matlab) • Analyzed recognition rates with varying parameters • Explored a systematic way to perform classification The best method so far is k-NN with a recognition rate of around 50%. This could be improved further by including other time and frequency features, and applying FLD instead of PCA. The downside to k-NN is the large computation time, which may preclude it from any real-time system. 19 Future Directions • • • • • • Vibration sensors More recordings Timing (multiple frames, HMM) Real-time Continuous strokes Integrate context (rhythmic patterns) As far as future directions are concerned, I would like to continue exploring my dataset with other classifier, such as Multi-Linear Analysis. And also use automated tools for labeling and strokes extraction (based on automatic onset detection for instance). As far as the big picture goes, I would use FSR sensor instead of a microphone (requiring a new set of recordings to be made) in order to minimize the feedback (which may cause false alarms) between an audio speaker playing tabla sounds and the actual tabla. I am also planning to explore real-time recognition of continuous tabla strokes, and integrate context information such as the rhythmic patterns being played, which may affect the priors of each stroke after each beat. 20