Estimation and sample size calculations for matching performance of biometric authentication

Transcription

Estimation and sample size calculations for matching performance of biometric authentication
Estimation and sample size calculations for
matching performance of biometric
authentication 1
Michael E. Schuckers
Department of Mathematics, Computer Science and Statistics
St. Lawrence University, Canton, NY 13617 USA
and
Center for Identification Technology Research (CITeR)
Abstract
Performance of biometric authentication devices can be measured in a variety of
ways. The most common way is by calculating the false accept and false reject
rates, usually referred to as FAR and FRR, respectively. In this paper we present
two methodologies for creating confidence intervals for matching error rates. The
approach that we take is based on a general parametric model. Because we utilize
a parametric model, we are able to ’invert’ our confidence intervals to develop
appropriate sample size calculations that account for both number of attempts per
person and number of individuals to be tested– a first for biometric authentication
testing. The need for sample size calculations that acheive this is currently acute in
biometric authentication. These methods are approximate and their small sample
performance is assessed through a simulation study. The distribution we use for
simulating data is one that arises repeatedly in actual biometric tests.
Key words: False match rate, false non-match rate, intra-individual correlation,
logit transformation, Beta-binomial distribution, confidence intervals, Monte Carlo
simulation, sample size calculations
1991 MSC: 62F11, 62F25, 62N99
Email address: [email protected] (Michael E. Schuckers).
This research was made possible by generous funding from the Center for Identification Technology Research (CITeR) at the West Virginia University and by NSF
grant CNS-0325640 which is cooperative funded by the National Science Foundation
and the United States Department of Homeland Security.
1
Preprint submitted to Pattern Recognition
21 November 2005
1
Introduction
Biometric authentication devices or biometric authenticators aim to match a
presented physiological image against one or more stored physiological images.
This matching performance of a biometric authenticator (BA) is an important
aspect of their overall performance. Since each matching decision from a BA
results in either a “reject” or an “accept”, the two most common measures of
performance are the false reject rate and the false accept rate. These rates are
often abbreviated by FAR and FRR, respectively, and the estimation of these
quantities is the focus of this paper. The methods described herein are equally
applicable for false match rates and false non-match rates since data for these
summaries are collected in a similar manner. As acknowledged in a variety of
papers, including [1], [2], and more recently by [3], there is an ongoing need
for assessing the uncertainty in the estimation of these error rates for a BA.
In particular there is an immense need for tools that assess the uncertainty
in estimation via confidence intervals (CI’s) and for sample size calculations
grounded in such intervals. Along these lines, the question of how many individuals to test is particularly difficult because biometric devices are generally
tested on multiple individuals and each individual is tested multiple times. In
this paper we present two CI methodologies for estimating the matching error
rates for a BA. We then ‘invert’ the better performing of these CI’s to create
sample size calculations for error rate estimation.
Several methods for the estimation of FAR and FRR appear in the biometrics literature. These generally fall into two approaches: parametric and nonparametric. Among the parametric approaches, [4] uses the binomial distribution to make confidence intervals and obtain sample size calculations, while [5]
models the features by a multivariate Gaussian distribution to obtain estimates of the error rates. The approach of [6] is to assume that individual
error rates follow a Gaussian distribution. In a similar vein, [7] assumes a
Beta distribution for the individual error rates and uses maximum likelihood
estimation to create CI’s. Several non-parametric approaches have also been
considered. [8] outlined exact methods for estimating the error rates for binomial data as well as for estimating the FAR when cross-comparisons are used.
Two resampling methods have been proposed. They are the ‘block bootstrap’
or ‘subsets bootstrap’ of [1] and the balanced repeated replicates approach
of [9]. It is worth noting that non-parametric methods do not allow for sample size calculations since it is not possible to ‘invert’ these calculations. The
approach taken in this paper will be a parametric one that allows for sample
size calculations.
In addition to CI methods, several approximate methods have been proposed
for sample size calculations. ”Doddington’s Rule” [10] states that one should
collect data until there are 30 errors. Likewise the ”Rule of 3” [2] is that
2
3/(the number of attempts) is an appropriate upper bound for a 95% CI for
the overall error rate when zero errors are observed. However, both of these
methods make use of the binomial distribution which is often an unacceptable
choice for biometric data [8]. [6] developed separate calculations for the number
of individuals to test and for the number of tests per individual. Here we
propose a single formula that accounts for both.
In this paper we propose two methods for estimating the error rates for BA’s.
The first of these methods is a simplification of the methodology based on
maximum likelihood estimation that is found in [7]. The advantage of the
approach taken here is that it does not depend on numerical maximization
methods. The second is a completely new method based on a transformation. For both methods we present simulations to describe how these methods
perform under a variety of simulated conditions. In order to validate the simulation strategy we show that the versatile distribution used in our simulation
is a ‘good’ fit for real data collected on BA’s. Finally we develop sample size
calculations based on the second methodology since that method performed
better. The paper is organized in the following manner. Section 2 describes
a general model formulated by [11] for dealing with overdispersion in binary
data. There we also introduce this model and the notation we will use throughout the paper. Section 3 introduces our CI methodologies based on this model.
Results from a simulation study of the performance of these CI’s are found in
Section 4. In that section, we also present an argument for the validity and
applicability of the simulated data. Section 5 derives sample size calculations
based on the second model. Finally, Section 6 contains a discussion of the
results presented here and their implications.
2
Extravariation Model
As mentioned above, several approaches to modelling error rates from a BA
have been developed. In order to develop sample size calculations, we take a
flexible parametric approach. Previously [7] presented an extravariation model
for estimating FARs and FRRs based on the Beta-binomial distribution which
assumes that error rates for each individual follow a Beta distribution. Here
we follow [11] in assuming the first two moments of a Beta-binomial model
but we do not utilize the assumptions of the shape of the Beta distribution for
individual error rates found in [7]. We also replace the numerical estimation
methods therein with closed form calculations. Details of this model are given
below.
We begin by assuming an underlying population error rate, either FAR or
FRR, of π. Following [2], let n be the number of comparison pairs tested and
let mi be the number of decisions made regarding the ith comparison pair with
3
i = 1, 2, . . . , n. We define a comparison pair broadly to encompass any measurement of a biometric image to another image, of an image to a template or
of a template to a template. This enables us to model both FAR and FRR in
the same manner. Then for the ith comparison pair, let Xi represent the observed number of incorrect decisions from the mi attempts and let pi = m−1
i Xi
represent the observed proportion of errors from mi observed decisions from
the ith comparison pair. We assume that the Xi ’s are conditionally independent given mi , n, π and ρ. Then,
E[Xi | π, ρ, mi ] = mi π
V ar[Xi | π, ρ, mi ] = mi π(1 − π)(1 + (mi − 1)ρ)
(1)
where ρ is a term representing the degree of extravariation in the model. The
assumption of conditional independence is the same one that is implicit in
the ‘subset bootstrap’ of [1]. The ρ found in (1) is often referred to as the
intra-class correlation coefficient, see e.g. [12] or [13]. Here we will refer to it
as the intra-comparison correlation The model in (1) reduces to the binomial
if ρ = 0 or mi = 1 for all i and, thus, (1) is a generalization of the binomial
that allows for within comparison correlation.
3
Confidence Intervals
The previous section introduced notation for the extravariation model. Here
we use this model for estimating an error rate π. Suppose that we have n
observed Xi ’s from a test of a biometric authentication device. We can then
use that data to estimate the parameters of our model. Let
π
ˆ=
ρˆ =
n
X
i=1
n
P
i=1
mi
!−1
n
X
Xi
i=1
Xi (Xi − 1) − 2ˆ
π (mi − 1)Xi + mi (mi − 1)ˆ
π2
n
P
i=1
(2)
mi (mi − 1)ˆ
π (1 − π
ˆ)
This estimation procedure for ρ is given by [14], while here we use a traditional
unbiased estimator of π.
4
3.1 Traditional Confidence Interval
We simplify the approach of [7] by replacing the maximum likelihood estimates
with a moments-based approach. Thus we have an estimate, π
ˆ , of the error
rate, π, and the intra-comparison correlation, ρˆ. We use these to evaluate the
standard error of π
ˆ following (1) assuming that the image pairs tested are
conditionally independent of each other. The estimated variance of π
ˆ is then
Xi
Vˆ [ˆ
π ] = Vˆ [ P ]
mi
X
X
= ( mi )−2
Vˆ [Xi ]
P
≈
where
π
ˆ (1 − π
ˆ )(1 + (m0 − 1)ˆ
ρ)
P
mi
m0 = m
¯ −
(3)
n
P
i=1
(mi − m)
¯ 2
(4)
mn(n
¯
− 1)
P
and m
¯ = n−1 ni=1 mi . Note that in the notation of [7], (1 + (m0 − 1)ρ) = C.
We can create a nominally 100 × (1 − α)% CI for π from this. Using the results
in [15] about the sampling distribution of πˆ , we get the following interval
π
ˆ ± z1− α2
"
π
ˆ (1 − π
ˆ )(1 + (m0 − 1)ˆ
ρ)
P
mi
#1/2
(5)
where z1− α2 represents the 100×(1− α2 )th percentile of a Gaussian distribution.
Our use of the Gaussian or Normal distribution is justified by the asymptotic
properties of these estimators [14, 15].
3.2 Logit Confidence Interval
One of the traditional difficulties with estimation of proportions near zero
(or one) is that sampling distributions of the estimated proportions are nonGaussian. Another problem is that CI’s for proportions, such as that given in
(5), are not constrained to fall within the interval (0, 1). The latter is specifically noted by [2]. One method that has been used to compensate for both of
these is to transform the proportions to another scale. Many transformations
for proportions have been proposed including the logit, probit and arcsin of
the square root, e. g. [14]. [16] has an extensive discussion of transformed
CI’s of the kind that we are proposing here. Below we use the logit or logodds transformation to create CI’s for the error rate π. [17] offers a specific
discussion of CI’s based on a logit transformation for binomial proportions.
5
Table 1
Sample Confidence Intervals Based on the Traditional Approach
Confidence Interval
Modality
Rate
n
m
π
ˆ
ρˆ Lower Endpoint Upper Endpoint
Hand
FAR
2450
5 0.0637 0.0621
0.0588
0.0685
Finger
FAR
2450
5 0.0589 0.0021
0.0548
0.0631
Face
FAR
2450
5 0.0189 0.0066
0.0158
0.0206
Hand
FRR
50 10 0.0520 0.0000
0.0325
0.0715
Finger
FRR
50 10 0.0420 0.0666
0.0198
0.0642
Face
FRR
50 10 0.0300 0.0759
0.0106
0.0494
The logit or log-odds transformation is one of the most commonly used transformations in statistics. For this reason, we focus on the logit over other
π
) as the natural logarithm of the
transformations. Define logit(π) ≡ ln( 1−π
odds of an error occurring. The logit function has a domain of (0, 1) and a
range of (−∞, ∞). One advantage of using the logit transformation is that
we move from a bounded parameter space, π ∈ (0, 1), to an unbounded one,
logit(π) ∈ (−∞, ∞). Thus, our approach is as follows. We first transform our
estimand, π, and our estimator π
ˆ to γ ≡ logit(π) and γˆ ≡ logit(ˆ
π ), respectively. Next, we create a CI for γ using an approximation to the standard error
of γˆ . Finally, we invert the endpoints of that interval back to the original scale.
eγ
Letting ilogit(γ) ≡ logit−1 (γ) = 1+e
γ , we can create a 100(1 − α)% CI using
γˆ . To do this we use a Delta method expansion for the estimated standard
error of γˆ . (The Delta method, as it is known in the statistical literature, is
simply a one step Taylor series expansion of the variance. See [14] for more
details.) Then our CI on the transformed scale is
γˆ ± z1− α2
1 + (m0 − 1)ˆ
ρ
π
ˆ (1 − π
ˆ )mn
!1
2
(6)
where m
¯ = n1 ni=1 . Thus (6) gives a CI for γ = logit(π) and we will refer to
the endpoints of this interval as γL and γU for lower and upper respectively.
The final step for making a CI for π is to take the ilogit of both endpoints
of this interval which results in (ilogit(γL ), ilogit(γL )). Thus an approximate
(1 − α) ∗ 100% CI for π is
P
(ilogit(γL ), ilogit(γU )).
(7)
The interval, (7), is asymmetric because the logit is not a linear transformation. This differs from a traditional CI’s that are plus or minus a margin of
6
Table 2
Sample Confidence Intervals Based on the Logit Approach
Confidence Interval
Modality
Rate
n
m
π
ˆ
ρˆ
Lower Endpoint
Upper Endpoint
Hand
FAR
2450
5
0.0637
0.0621
0.0590
0.0687
Finger
FAR
2450
5
0.0589
0.0021
0.0549
0.0633
Face
FAR
2450
5
0.0189
0.0066
0.0160
0.0208
Hand
FRR
50
10
0.0520
0.0000
0.0356
0.0753
Finger
FRR
50
10
0.0420
0.0666
0.0246
0.0708
Face
FRR
50
10
0.0300
0.0759
0.0156
0.0568
error. However, this interval has the same properties as other CI’s. (See [18]
for a rigorous definition of a CI.) In addition, this interval is guaranteed to
fall inside the interval (0, 1) as long as at least one error is observed. In the
next section we focus on how well the CI’s found in (6) and (7) perform for
reasonable values of π, ρ, m, and n.
3.3 Examples
To illustrate these methods we present example CI’s for both of the CI methods
given above. Results for the traditional approach and the logit approach are
found in table 1 and 2, respectively. The data used for these intervals comes
from [19]. In that paper the authors investigated data from three biometric
modalities – face, fingerprint and hand geometry – and recorded the match
scores for ten within individual image pairs of 50 people and for five between
individual image pairs for those same 50 individuals. Note that the between
individual cross comparisons here are not symmetric and thus there were 49 ×
50 = 2450 comparison pairs in the sense we are using here. Thus there are 500
decisions to compare an individual to themselves and 12250 decisions regarding
an individual to another individual. Here several things are apparent from
these results. For this data the two intervals produce similar endpoints on the
same data. This is a result of the relatively large n’s. As noted earlier, the
logit CI is asymmetric and has intervals that are larger while the traditional
confidence interval is symmetric.
7
Table 3
Goodness-of-fit test results for hand geometry FAR’s from data found in [19],
4
Threshold
π
ˆ
p-value
80
0.1136
0.0017
70
0.0637
0.1292
60
0.0272
0.6945
50
0.0098
0.9998
40
0.0016
0.9972
30
0.0008
0.9996
Assessing Performance
To test the small sample performance of these CI’s we simulate data from
a variety of different scenarios. Simulations were run because they give the
best gauge of performance for statistical methodology under a wide variety
of parameter combinations. We will refer to each parameter combination as a
scenario. Assuming that the simulated data is similar in structure to observed
data, we get a much better understanding of performance from simulation than
from looking at a single observed set of data. Below we argue that the distribution that we use for simulation is an excellent fit to data from [19]. Further
details on the Monte Carlo approach to evaluating statistical methodology can
be found in [20]. Under each scenario, 1000 data sets were generated and from
each data set a nominally 95% CI was calculated. The percentage of times that
π is captured inside these intervals is recorded and referred to as the empirical
coverage probability or, simply, the coverage. For a 95% CI, we should expect
the coverage to be 95%. However, this is not always the case especially for
small sample sizes. Below we consider a full factorial simulation study using
the following values: n = (1000, 2000), m = (5, 10), π = (0.005, 0.01, 0.05, 0.1),
and ρ = (0.1, 0.2, 0.4, 0.8). For simplicity we let mi = m, i = 1, . . . , n for these
simulations. These values were chosen to determine their impact on the coverage of the confidence intervals. Specifically, these values of π were chose to be
representative of possible values for a BA, while the chosen values of ρ were
chosen to represent a larger range than would be expected. Performance for
both of the methods given in this paper is exemplary when 0.1 < π < 0.9.
Because of the symmetry of binary estimation, it is sufficient to consider only
values of π less than 0.1.
8
Table 4
Goodness-of-fit test results for fingerprint FAR’s from data found in [19],
Threshold
π
ˆ
p-value
10
0.0930
0.8191
20
0.0589
0.9761
30
0.0292
0.9276
40
0.0114
0.3726
50
0.0074
0.7563
60
0.0042
0.9541
70
0.0032
0.9781
80
0.0016
0.9972
90
0.0004
0.9987
4.1 Goodness-of-Fit Tests
Because it is easy to generate from a Beta-binomial, we would like to utilize this distribution for our simulations. To determine whether or not the
Beta-binomial distribution is appropriate for generating data, we considered
biometric decision data from [19]. To determine whether or not the Betabinomial distribution was appropriate we computed “goodness-of-fit” tests
statistics and p-values which are discussed by a several authors, e.g. [21].
The idea of a “goodness-of-fit” test is that we fit the distribution to be tested
and determine if the observed data are significantly different from this structure. Summaries based on the data are compared to summaries based on the
null distributional form. In the case of the Beta-binomial distribution, we
compare the expected counts if the data perfectly followed a Beta-binomial
distribution to the observed counts. [21] gives an excellent introduction to
these tests. Tables 3, 4 and 5 summarize the results of these tests for FAR’s
across the three modalities. Tables 6, 7 and 8 repeat that analysis for FRR’s.
Note that for hand match scores we accept below the given threshold, while
for finger and face match scores we accept above the given threshold.For both
of these tables small p-values indicate lack of fit and that the null hypothesis that the Beta-binomial distribution fits this data should be rejected. A
more general “goodness-of-fit” test is given by [22] when the value of mi varies
across comparison pairs.
Looking at Tables 3 to 5 as well as Tables 6 to 8, we can readily see that
the Beta-binomial fits both FAR and FRR quite well for all three modalities.
9
Table 5
Goodness-of-fit test results for facial recognition FAR’s from data found in [19],
Threshold
π
ˆ
p-value
60
0.0876
< 0.0001
50
0.0446
0.3124
45
0.0291
0.9539
40
0.0182
0.9948
35
0.0446
0.5546
30
0.0047
0.9323
25
0.0024
0.9908
Table 6
Goodness-of-fit test results for FRR’s from data found in [19],
Hand FRR
Threshold
π
ˆ
p-value
100
0.1120
0.7803
120
0.0520
0.7813
140
0.0280
0.9918
160
0.0180
0.9986
180
0.0120
0.9950
200
0.0100
0.9905
220
0.0080
0.9795
240
0.0060
0.9255
10
Only two of the fifty-one thresholds considered resulted in a rejection of the
Beta-binomial distribution as inappropriate. This is approximately what we
would expect by chance alone using a significance level of 5%. For this analysis
we reported on a subset of thresholds that produced FAR’s and FRR’s near
or below 0.1. This choice was made because it is unlikely that a BA would be
implemented with error rates above that cutoff. It is important to note that
we are not arguing here that binary decision data, the Xi ’s, from a biometric
experiment will always follows a Beta-binomial distribution. Nor are we stating
that Beta-binomial data is necessary for the use of these CI’s. (As mentioned
above, we are only specifying the first two moments of the distribution of the
Xi ’s instead of specifying a particular shape for the distribution.) Rather, what
we conclude from the above results in Tables 3 through 8 is that the Betabinomial is a reasonable distribution for simulation of small sample decision
data since it fit data from three different modalities well. Thus we generate
Xi0 s from a Beta-binomial distribution to test the performance of the CI’s
methods specified above.
4.2 Simulation Results
Before presenting the simulation results, it is necessary to summarize our
goals. First, we want to determine for what combination of parameters, the
methodology achieves coverage close to the nominal level, in this case, 95%.
Second, because we are dealing with simulations, we should focus on overall
trends rather than on specific outcomes. If we repeated these same simulations
again, we would see slight changes in the coverages of individual scenarios but
the overall trends should remain. Third, we would like to be able to categorize which parameter combinations give appropriate performance. We use the
Monte Carlo approach because it is more complete than would be found in
the evaluation of a “real” data set. See, e.g. [20] for a complete discussion.
Evaluations from observed test data gives a less complete assessment of how
well an estimation method performs since there is no way to know consider
all the possible parameter combinations from such data.
4.2.1 Traditional confidence interval performance
Using (5) we calculated coverage for each scenario. The results of this simulation can be found in Table 9. Several clear patterns emerge. Coverage increases
as π increases, as ρ decreases, as n increases and as m increases. This is exactly as we would have expected. More observations should increase our ability
to accurately estimate π. Similarly the assumption of approximate Normality
will be most appropriate when π is moderate (far from zero and far from one)
and when ρ is small. This CI performs well except when π < 0.01 and ρ ≥ 0.2.
11
Table 7
Goodness-of-fit test results for fingerprint FRR’s from data in [19],
Threshold
π
ˆ
p-value
50
0.1140
0.4134
40
0.0980
0.5160
30
0.0880
0.1554
20
0.0760
0.5121
10
0.0589
0.9761
5
0.0120
0.9950
1
0.0020
0.9999
Table 8
Goodness-of-fit test results for facial recognition FRR’s from data found in [19]
Threshold
π
ˆ
p-value
45
0.1060
0.2614
50
0.0660
0.9509
55
0.0540
0.5353
60
0.0500
0.5885
65
0.0300
0.9216
70
0.0180
0.9067
75
0.0140
0.9067
80
0.0060
0.9985
85
0.0040
0.9996
90
0.0040
0.9996
95
0.0040
0.9996
100
0.0040
0.9996
105
0.0020
1.0000
12
There is quite a range of coverages from a high of 0.959 to a low of 0.896 with
a mean coverage of 0.940. One way to think about ρ is that it governs how
much ‘independent’ - in a statistical sense - information can be found in the
data. Higher values of ρ indicate that there is less ’independent’ information
in the data. This performance is not surprising since binary data is difficult to
assess when there is a high degree of correlation within a comparison. One reasonable rule of thumb is that the CI performs well when the effective sample
size, n† π ≥ 10 where
nm
n† =
(8)
1 + (m0 − 1)ρ
and is referred to as the effective sample size in the statistics literature [23].
4.2.2 Logit confidence interval performance
To assess how well this second interval estimates π we repeated the simulation
using (6) to create our intervals. Output from these simulations is summarized
in Table 10. Again coverage should be approximately 95% for a nominally 95%
CI. Looking at the results found in Table 10, we note that there are very similar
patterns to those found in the previous section. However, it is clear that the
coverage here is generally higher than for the traditional interval. As before
our interest is in overall trends. In general, coverage increases as π increases,
as ρ decreases, as m increases and as n increases. Coverages range from a high
of 0.969 to a low of 0.930 with a mean coverage of 0.949. Only one of the
coverages when n = 1000, m = 5, π = 0.005 and ρ = 0.4 is of concern here.
That value seems anomalous when compared to the coverage obtained when
n = 1000, m = 5, π = 0.005 and ρ = 0.8. Otherwise the CI based on a logit
transformation performed extremely well. Overall, coverage for the logit CI
is higher than for the traditional confidence interval. It performs well when
n† π ≥ 5. Thus, use of this CI is appropriate when the number of comparison
pairs is roughly half what would be needed for the traditional CI.
5
Sample size calculations
As discussed earlier and highlighted by [3], there is a pressing need for appropriate sample size calculations for testing of BA’s. Here we present sample size
calculations using the logit transformed interval since it gives better coverage
and requires fewer observations for its usage than the traditional approach.
(It is straightforward to solve (5) as we do below for (6) to achieve a specified
margin of error.) Because of the way that matching performance for BA’s is
assessed, there are effectively two sample size for a biometric test: n and m.
The calculations given below solve for n, the number of comparison pairs,
conditional on knowledge of m, the number of decisions per comparison pair.
13
Table 9
Empirical Coverage Probabilities for Traditional Confidence Interval
n = 1000, m = 5
π\ρ
0.1
0.2
0.4
0.8
0.005
0.935
0.926
0.925
0.907
0.010
0.935
0.935
0.929
0.928
0.050
0.952
0.945
0.954
0.947
0.100
0.943
0.957
0.949
0.953
n = 1000, m = 10
π\ρ
0.1
0.2
0.4
0.8
0.005
0.931
0.926
0.922
0.896
0.010
0.941
0.934
0.924
0.924
0.050
0.945
0.949
0.948
0.945
0.100
0.949
0.944
0.947
0.950
n = 2000, m = 5
π\ρ
0.1
0.2
0.4
0.8
0.005
0.951
0.928
0.941
0.927
0.010
0.947
0.958
0.934
0.932
0.050
0.941
0.959
0.947
0.954
0.100
0.946
0.944
0.951
0.938
n = 2000, m = 10
π\ρ
0.1
0.2
0.4
0.8
0.005
0.941
0.927
0.930
0.919
0.010
0.945
0.941
0.941
0.940
0.050
0.942
0.941
0.941
0.940
0.100
0.949
0.950
0.953
0.950
Each cell represents the coverage based on 1000 simulated data sets.
14
Table 10
Empirical Coverage Probabilities for Logit Confidence Interval
n = 1000, m = 5
π\ρ
0.1
0.2
0.4
0.8
0.005
0.949
0.940
0.930
0.952
0.010
0.946
0.946
0.935
0.960
0.050
0.969
0.949
0.935
0.960
0.100
0.938
0.946
0.946
0.948
n = 1000, m = 10
π\ρ
0.1
0.2
0.4
0.8
0.005
0.952
0.937
0.941
0.952
0.010
0.948
0.945
0.943
0.964
0.050
0.952
0.947
0.952
0.959
0.100
0.945
0.944
0.950
0.952
n = 2000, m = 5
π\ρ
0.1
0.2
0.4
0.8
0.005
0.952
0.960
0.944
0.951
0.010
0.965
0.939
0.951
0.953
0.050
0.953
0.956
0.951
0.945
0.100
0.940
0.965
0.950
0.952
n = 2000, m = 10
π\ρ
0.1
0.2
0.4
0.8
0.005
0.947
0.937
0.954
0.956
0.010
0.960
0.944
0.945
0.943
0.050
0.946
0.951
0.939
0.947
0.100
0.954
0.957
0.944
0.950
Each cell represents 1000 simulated data sets.
15
Our sample size calculation require the specification of a priori estimates of
π and ρ. This is typical of any sample size calculation. In the next section
we discuss suggestions for selecting values of π and ρ as part of a sample size
calculation. The asymmetry of the logit interval provides a challenge relative
to the typical sample size calculation. Thus rather than specifying the margin
of error as is typical, we will specify the desired upper bound for the CI, call
it πmax . Given the nature of BA’s and their usage, it seems somewhat natural
to specify the highest acceptable value for the range of the interval. We then
set (6) equal to the logit(πm ax) and solve for n.
Given (6), we can determine the appropriate sample size needed to estimate
π with a certain level of confidence, 1 − α, to be a specified upper bound,
πmax . Since it is not possible to simultaneously solve for m and n, we propose
a conditional solution. First, specify appropriate values for π, ρ, πmax , and
1 − α. Second, fix m, the number of attempts per comparison. We assume for
sample size calculations that mi = m for all i. (If significant variability in the
mi ’s is anticipated then we recommend using a value of m that is slightly less
than the anticipated average of the mi ’s.) Third solve for n, the number of
comparisons to be tested. We then find n via the following equation, given the
other quantities,

z1− α2
n=

 logit(πmax ) − logit(π)
!2

1 + (m − 1)ρ 
.
mπ(1 − π) 

(9)
The above follows directly from (6). To illustrate this suppose we want to
estimate π to an upper bound of πmax = 0.01 with 99% confidence and we
believe π to be 0.005 and ρ to be 0.2. If we plan on testing each comparison
pair 5 times we would need

2.576
n=

 logit(0.01) − logit(0.005)
!2
= d984.92e
= 985.

(1 + (5 − 1)0.2) 
5(0.005)(0.995) 

(10)
So we would need to test 985 comparison pairs 5 times each to achieve a 99% CI
with an upper bound of 0.01. If asymmetric cross-comparisons are to be used
among multiple individuals, then one could replace n on the right hand side of
(9) with n ∗ (n ∗ −1) and solve for n∗. In the example above, n∗ = 32 would be
the required number of individuals. In the case of symmetric cross comparisons
would solve for n∗(n∗−1)/2 = 986 which yields n∗ = 45 individuals assuming
the conditions specified above. Table 11 contains additional values of n for
given values of m. In addition this table contains mn, the total number of
“decisions” that would be needed to achieve the specified upper bound for this
CI. Clearly the relationship between mn and n is non-linear. This concurs with
16
Table 11
n necessary to create a 99% confidence interval with π = 0.005, πmax = 0.01,
ρ = 0.2 for various values of m.
m
n
mn
2
1642
3284
5
985
4925
8
821
6568
10
767
7670
12
730
8760
15
694
10410
20
657
13140
30
621
18630
the observation of [2] when they discuss the “non-stationarity” of collecting
biometric data.
6
Discussion
The recent Biometric Research Agenda stated clearly that one of the fundamental needs for research on BA’s was the development of “statistical understanding of biometric systems sufficient to produce models useful for performance evaluation and prediction,” [3, p. 3]. The methodologies discussed
in this paper are a significant step toward that. This paper adds two significant tools for testers of biometric identification devices: well-understood CI
methodology and a formula for determining the number of individuals to be
tested. These are significant advances to core issues in the evaluation, assessment and development of biometric authentication devices. Below we discuss
the properties of these methods and outline some future directions for research
in this area.
The models we have developed are based on the following widely applicable
assumptions. First we assume that the moments of the Xi ’s are given by (1).
Second we assume that attempts made by each comparison are conditionally
independent given the model parameters. We reiterate that an analysis of
data found in [19] suggests that these are reasonable assumptions. For any
BA, its matching performance is often critical to the overall performance of
the system in which it is imbedded. In this paper we have presented two new
methodologies for creating a CI for an error rate. The logit transformed CI,
(6), had superior performance to the traditional CI. This methodology did
well when n† π > 5. Though this study presented results only for 95% CI’s, it
17
is reasonable to assume performance will be similar for other confidence levels.
Further, we have presented methodology for determining the number of attempts needed for making a CI. This is an immediate consequence of using
a parametric CI. Because of the asymmetry of this CI, it is necessary specify
the upper bound for the CI as well as specifying m, π and ρ. All sample size
calculations carried out before data is collected require estimates of parameters. To choose estimates we suggest the following possibilities in order of
importance.
(1) Use estimates for π and ρ from previous studies collected under similar
circumstances.
(2) Conduct a pilot study with some small number of comparisons and a
value of m that will likely be used in the full experiment. That will allow
for reasonable estimates of π and ρ.
(3) Make a reasonable estimate based on knowledge of the BA and the environment in which it will be tested. One strategy here is to overestimate
π and ρ which will generally yield n larger than is needed.
As outlined above, this now gives BA testers an important tool for determining
the number of comparisons and the number of decisions per comparison pair
necessary for assessing a single FAR or FRR.
References
[1] R. M. Bolle, N. K. Ratha, S. Pankanti, Error analysis of pattern recognition
systems – the subsets bootstrap, Computer Vision and Image Understanding
93 (2004) 1–33.
[2] T. Mansfield, J. L. Wayman, Best practices in testing and reporting
performance of biometric devices, on the web at www.cesg.gov.uk/site/
ast/biometrics/media/BestPractice.pdf (2002).
[3] E. P. Rood, A. K. Jain, Biometric research agenda, Report of the NSF Workshop
(2003).
[4] W. Shen, M. Surette, R. Khanna, Evaluation of automated biometrics-based
identification and verification systems, Proceedings of the IEEE 85 (9) (1997)
1464–1478.
[5] M. Golfarelli, D. Maio, D. Maltoni, On the error-reject trade-off in biometric
verification systems, IEEE Transactions on Pattern Analysis and Machine
Intelligence 19 (7) (1997) 786–796.
[6] I. Guyon, J. Makhoul, R. Schwartz, V. Vapnik, What size test set gives good
error rate estimates, IEEE Transactions on Pattern Analysis and Machine
Intelligence 20 (1) (1998) 52–64.
18
[7] M. E. Schuckers, Using the beta-binomial distribution to assess performance of
a biometric identification device, International Journal of Image and Graphics
3 (3) (2003) 523–529.
[8] J. L. Wayman, Confidence interval and test size estimation for biometric data,
in: Proceedings of IEEE AutoID ’99, 1999, pp. 177–184.
[9] R. J. Michaels, T. E. Boult, Efficient evaluation of classification and recognition
systems, in: Proceedings of the International Conference on Computer Vision
and Pattern Recognition, 2001.
[10] G. R. Doddington, M. A. Przybocki, A. F. Martin, D. A. Reynolds, The
NIST speaker recognition evaluation: overview methodology, systems, results,
perspective, Speech Communication 31 (2-3) (2000) 225–254.
[11] D. F. Moore, Modeling the extraneous variance in the presence of extra-binomial
variation, Applied Statistics 36 (1) (1987) 8–14.
[12] G. W. Snedecor, W. G. Cochran, Statistical Methods, 8th Edition, Iowa State
University Press, 1995.
[13] W. G. Cochran, Sampling Techniques, 3rd Edition, John Wiley & Sons, New
York, 1977.
[14] J. L. Fleiss, B. Levin, M. C. Paik, Statistical Methods for Rates and Proportions,
John Wiley & Sons, Inc., 2003.
[15] D. F. Moore, Asymptotic properties of moment estimators for overdispersed
counts and proportions, Biometrika 73 (3) (1986) 583–588.
[16] A. Agresti, Categorical Data Analysis, John Wiley & Sons, New York, 1990.
[17] R. G. Newcombe, Logit confidence intervals and the inverse sinh transformation,
The American Statistician 55 (3) (2001) 200–202.
[18] M. J. Schervish, Theory of Statistics, Springer-Verlag, New York, 1995.
[19] A. Ross, A. K. Jain, Information fusion in biometrics, Pattern Recognition
Letters 24 (13) (2003) 2115–2125.
[20] J. E. Gentle, Random Number Generation and Monte Carlo Methods, SpringerVerlag, 2003.
[21] D. D. Wackerly, W. M. III, R. L. Scheaffer, Mathematical Statistics with
Applications, 6th Edition, Duxbury, 2002.
[22] S. T. Garren, R. L. Smith, W. W. Piegorsch, Bootstrap goodness-of-fit test for
the beta-binomial model, Journal of Applied Statistics 28 (5) (2001) 561–571.
[23] L. Kish, Survey Sampling, John Wiley & Sons, New York, 1965.
19