Tabla Strokes Recognition

Transcription

Tabla Strokes Recognition
Tabla Strokes Recognition
Mihir Sarkar
1
Tabla?
The tabla is a pair of hand drums from North India. They are played
with the fingers and palms of both hands.
The right drum (from a player’s perspective) produces a high pitched
sound, whose pitch can be tuned with the rest of the orchestra. The left
drum produces bass sounds with varying pitches depending on the
pressure applied on the drumhead with the palm of the hand.
The tabla can play a variety of different sounds, both pitched and
unpitched. Each of these sounds, called bol, has a syllable associated
with it. Thus rhythmic compositions can be sung, and transmitted in an
oral tradition.
2
Context
• Can you distinguish different bols?
• Can a machine automatically classify tabla
strokes?
• Is there a systematic way to identify the
best method to recognize tabla strokes?
Humans can distinguish different bols after some minimal training.
How can a machine classify tabla strokes, and how would it compare
with a human?
In this work, I study a systematic way to recognize tabla strokes by
trying out various parameters to select the optimum features for
classification.
Tabla strokes recognition can be interesting both for music information
retrieval in large multimedia databases, or for automatic transcription of
tabla performances.
3
Vision
In my case, this work falls in the larger vision of a real-time online music collaboration
system for Indian percussion, a system with which 2 tabla players can play with each
other over a network. The computer system recognizes the tabla strokes at one end,
and sends the recognized bols as symbols instead of an audio stream to the other
end, which will then synthesize the appropriate tabla sounds. This will minimize
transmission delay. Algorithmic delays can be compensated with a prediction system,
and additional machine intelligence.
4
Experimental Setup
•
•
•
•
•
•
1 tabla set
3 tabla players
10 bols
413 recordings (kept 300)
Microphone input (studio recording)
Discrete strokes
I used one tabla set for recordings. 3 players (Manu Gupta, Graham
Grindlay, and myself) played 10 of the most common bols on the tabla,
10 times each. I actually got more recordings, but kept 300 so as to
have equal priors for each class (i.e. each type of stroke). The
recordings were done in a studio environment with low noise. Discrete
strokes were played, as opposed to elaborate rhythmic patterns.
5
Raw data
These are time-domain waveforms of various strokes, Na on the upper
left, Tin (very similar) on the upper right, Ga on the lower left (lower
frequency, played on the left drum), Dha--a combination of Na and Ga-in the center, and Ka, a percussive dry sound, at the lower right hand
side.
6
Spectrogram
The spectrograms give some information about the strokes (each
spectrograms is at the position corresponding to the waveforms
discussed previously): the Na (upper left) has a clear pitch, whereas the
Ka (lower right) looks mostly like noise. But it is difficult to make out
actual features from these graphs. However we can see a clear onset
where all frequencies are present (white noise), and then the spectrum
seems to be relatively steady as it dies down.
7
Feature Extraction
• Time domain: ZCR
• Frequency domain: PSD
• Cepstral domain: MFCC
Some of the possible features include zero-crossings (based on the
literature), but I quickly dropped it because of the noise level on the
recordings, which, although low, created zero crossings throughout
each frame. In the frequency domain, I considered the Power Spectral
Density (Welch’s method--other algorithms, such as the Periodogram
method, performed similarly--with a 3% variance at most). I also
considered Mel Frequency Cepstrum Coefficients, widely used for
music information retrieval and speech recognition.
8
Dataset Selection
• Orthogonal dimensions:
– Instances
– Bols
– Players
• Training / leave-one-out validation
• Testing
The recordings provided me with 3 orthogonal dimensions in my
dataset: each instance of a stroke played by each player, the strokes
themselves, and the players.
I considered an 80/20% break-up of my dataset for training and testing.
9
Baseline
• Random: 10%
• Human: 87%
• Initial k-NN: 18%
(Welch’s PSD, NFFT = 16, k = 1)
For comparison purpose, the random baseline (with equal priors) gives
a recognition rate of 10%. I was also curious as to how a human would
actually perform. I performed the test and came out with a score of 87%
on the testbed of my dataset.
My first run of an automatic classifier gave me 18%, definitely
something to work on.
10
k-NN
I compared PSD (NFFT = 512) with MFCC (20 coefficients) with the
whole set of features (no reduction), and plotted the evidence curve.
MFCC seems to perform optimally for k = 1. Using both MFCC and
PSD vectors does not improve recognition over MFCC alone.
11
k-NN
Then I reduced dimensionality with PCA (to 8 coefficients), and there,
MFCC performed worse then PSD. The reason seems to be that MFCC
requires all coefficients to represent the data (it spreads out the
information evenly), and a smaller set of features may carry too little
information for recognition purpose.
12
k-NN
With PCA reduction to 16 coefficients, MFCC outperforms PSD for k =
1 but falls quickly for values other than 1. PSD performs more
consistently on this range of k (1 to 10). Therefore I decided to continue
this study by keeping the PSD feature vector, and dropping out MFCC.
13
k-NN
The idea presented previously about MFCC seems to hold as with a
PCA reduction to 4, PSD still clearly outperforms MFCC.
14
k-NN
With a fixed PSD length of 512, and a PCA reduction to 8 dimensions, I
now study the behavior of the classifier when varying the FFT length
used in the PSD computation. As seems intuitively acceptable, the
higher the resolution, the better the classification. However the rate of
increase drops down after 512 points. Therefore the best trade-off
between computational requirements and recognition results looks to
be NFFT = 512, which is what I keep from now on.
15
k-NN
I also study the effect of the frame size on the recognition rate. By
default, I chose a value of 500ms for the duration of the samples. I
imagine that if I increase the sample length, I might get better results by
including more information in the frame. However this graph shows that
the 750ms case performs worse than 500ms. The reason could be that
many strokes have a short duration; thus the samples do not carry
more information after a certain duration, but merely noise. It is
interesting to note that when I reduce the frame length to 250ms, the
recognition rate improves compared to 500ms, suggesting that most of
the information is indeed carried in the first part of the frame. So what if
further reduce the frame size (to 100ms)? The results here are
abysmal. The reason I propose is that although most of the strokes’
information is carried in the early part of the frame, the initial onset
carries all frequencies equally (like white noise--see spectrograms).
And only after the onset does the waveform carry discriminatory
information. So we could imagine a system which uses the onset to
detect the stroke, but ignores that part of the waveform, and then
performs recognition on the next couple hundred milliseconds (which is
good news for continuous stroke recognition).
16
Confusion Matrix
Te
Re
Tat
Thun
0
0
0
0
1
1
0
0
0
0
2
0
0
2
0
0
0
0
0
3
0
1
0
1
0
0
0
2
0
2
1
1
0
0
0
0
0
1
0
1
4
0
0
0
0
Te
0
1
0
0
0
0
1
0
4
0
Na
Tin
Ga
Ka
Dha Dhin
Na
5
0
0
0
0
Tin
0
3
0
0
Ga
0
0
4
Ka
0
1
Dha
0
Dhin
Re
0
0
0
0
0
0
1
4
0
1
Tat
0
0
0
0
0
0
2
1
3
0
Thun
0
1
0
0
0
0
0
0
0
5
The recognition rate for the test dataset based on the previous
parameters (k = 1, NFFT = 512, duration = 256ms) performs just above
50%. This confusion matrix (rows are actual classes, columns are
estimated classes) is interesting because it gives an idea of what are
the confusing strokes for the system. And they do correlate with strokes
that are difficult to distinguish for humans.
17
Neural Networks
Nodes
20
22
24
36
40
42
46
48
50
52
Validation
53
47
13
41
55
70
40
40
45
72
Testing
5
13
15
10
18
12
6.7
10
10
10
I also tried classification using neural nets (1 hidden layer). This gave
very low results compared with k-NN. The reason could be that the
number of data points is too low for training.
18
Contributions
• Implemented pattern classification
algorithms (Matlab)
• Analyzed recognition rates with
varying parameters
• Explored a systematic way to perform
classification
The best method so far is k-NN with a recognition rate of around 50%.
This could be improved further by including other time and frequency
features, and applying FLD instead of PCA. The downside to k-NN is
the large computation time, which may preclude it from any real-time
system.
19
Future Directions
•
•
•
•
•
•
Vibration sensors
More recordings
Timing (multiple frames, HMM)
Real-time
Continuous strokes
Integrate context (rhythmic patterns)
As far as future directions are concerned, I would like to continue
exploring my dataset with other classifier, such as Multi-Linear Analysis.
And also use automated tools for labeling and strokes extraction (based
on automatic onset detection for instance). As far as the big picture
goes, I would use FSR sensor instead of a microphone (requiring a new
set of recordings to be made) in order to minimize the feedback (which
may cause false alarms) between an audio speaker playing tabla
sounds and the actual tabla. I am also planning to explore real-time
recognition of continuous tabla strokes, and integrate context
information such as the rhythmic patterns being played, which may
affect the priors of each stroke after each beat.
20