Using the conditional grade-of-membership model to assess

Transcription

Using the conditional grade-of-membership model to assess
PSYCHOMETRIKA--VOL. 68, NO. 3, 453-471
SEPTEMBER 2003
USING THE CONDITIONAL GRADE-OF-MEMBERSHIP MODEL
TO ASSESS JUDGMENT ACCURACY
BRUCE COOIL
OWEN GRADUATE SCHOOL OF MANAGEMENT
VANDERBILT UNIVERSITY
SAJEEV VARKI
COLLEGE OF BUSINESS ADMINISTRATION
UNIVERSITY OF RHODE ISLAND
Consider the case where J instruments are used to classify each of I objects relative to K nominal categories. The conditional grade-of-membership (GoM) model provides a method of estimating the
classification probabilities of each instrument (or "judge") when the objects being classified consist of
both pure types that lie exclusively in one of K nominal categories, and mixtures that lie in more than
one category. Classification probabilities are identifiable whenever the sample of GoM vectors includes
pure types from each category. When additional, relatively mild, assumptions are made about judgment
accuracy, the identifiable correct classification probabilities are the greatest lower bounds among all solutions that might correspond to the observed multinomial process, even when the unobserved GoM vectors
do not include pure types from each category. Estimation using the conditional GoM model is illustrated
on a simulated data set. Further simulations show that the estimates of the classification probabilities are
relatively accurate, even when the sample contains only a small percentage of approximately pure objects.
Key words: nominal classification, incidental parameters, extreme profiles, mixtures.
1. Introduction
Imagine any general setting where J instruments are used to classify each of I objects relative to K nominal categories. The instruments themselves may be doctors making diagnoses,
coders classifying open-ended survey responses into nominal categories, questions on a psychological test or survey where a subject selects one of several preselected responses, or any other
instrument used to classify objects into categories. For simplicity we will refer to these instruments as judges. In these cases, the objects would be the patients seeking a diagnosis, individual
responses to open-ended survey questions that are being coded into categories, or individuals
taking a test or survey, respectively. Judgment-based classification has been used widely in psychology to study personality, vocational interests, psychiatric diagnoses, and even interpersonal
interactions (e.g., Chavez & Buriel, 1988), but the applications also extend across the domain of
social sciences. For example, it has been used in education to study teaching styles (e.g., Tsai &
Denton, 1993) and in marketing to study the content of advertisements (e.g., Yale & Gilly, 1988).
This classification framework has also been used to develop general models and estimators
for the reliability and structure of qualitative data (Batchelder & Romney, 1988; Cohen, 1960,
1968; Cooil & Rust, 1995; Dillon & Mulani, 1984; Klauer & Batchelder, 1996; Perreault &
Leigh, 1989). General latent class models allow one to use the data provided by the J judges to
determine the probabilities with which they make correct and incorrect classifications, as well
as the prior and posterior probabilities that a given object belongs to one of the K categories
The authors thank Max A. Woodbury, Kenneth G. Manton and H. Dennis Tolley for their help and four anonymous
Psychometrika reviewers (including an associate editor) for their beneficial expository and technical suggestions. This
work was supported by the Dean's Fund for Summer Research, Owen Graduate School of Management, Vanderbilt
University.
Requests for reprints should be sent to Bruce Cooil, OGSM, Vanderbilt University, 401 21st Avenue South,
Nashville, TN 37203. E-Mail: [email protected]
0033-3123/2003-3/1999-0742-A $00.75/0
@ 2003 The Psychometric Society
453
454
PSYCHOMETRIKA
(Batchelder & Romney, 1988, 1989; Dillon & Mulani, 1984). Manton, Woodbury, and Tolley
(1994) have developed a general grade-of-membership (GoM) model that extends the latent class
framework to include the identification of objects as mixtures of an arbitrary number of latent
nominal categories. This general GoM model has been used in a wide range of empirical applications to determine latent mixtures from high dimensional discrete multivariate data (e.g.,
Berkman, Singer, & Manton, 1989; Blazer et al., 1989; Ventrees & Manton, 1986; Woodbury &
Manton, 1982).
We consider an adaptation of the general GoM model to the classification framework. As in
the more general GoM framework, each object is potentially a mixture of the K categories (or
extreme profiles), and the degree to which object i belongs to the k-th category is denoted by the
grade-of-membership score gik which varies between 0 and 1 where
K
~
gik
=
1, 0 < gik <-- 1, i = 1 . . . . , I.
(1)
k=l
When gik = 1, object i has exclusive (or "crisp") membership in category k and is referred to as
a "pure" type, while gik = 0 indicates that object i has no membership in category k. Fractional
grade-of-membership values provide a representation of each object's heterogeneity that is not
possible in many classification models. In contrast to the more general GoM framework, we
assume the K categories are known beforehand (not latent) and the judges, who do not directly
observe the gik values, classify each object to one of the K categories. The only data available,
from the classification process, are the actual classifications of each object by each judge.
Consider a specific example where a psychological test is administered to determine a person's vocational interests (e.g., Holland's six vocational dimensions, Holland, 1985), and each
test item provides alternative responses that relate directly to the K manifest categories. In this
case gik, 1 < k < K , represents respondent i's degree of interest on vocational dimension k,
and the GoM model studied here provides estimates of the probabilities with which each test
item (judge) correctly or incorrectly classifies each basic type (i.e., an individual with a given
"pure" vocational interest, at the extreme of one of' the possible dimensions) into each one of the
possible basic types. These probability estimates directly measure test item reliability and also
provide a way of determining the overall reliability of any group of test items, or even of the test
itself (Cooil & Rust, 1995).
We present sufficient conditions under which the GoM model provides an identifiable matrix of the probabilities with which judges make classifications. Given that they are identifiable,
Tolley & Manton (1992) have shown that the estimates taken from the conditional version of
this model are consistent and asymptotically normal. In this nominal classification framework,
each judge makes an explicit classification to one of the K categories, and only the grade-ofmembership values (gik) in (1), which designate the actual origin of each object, remain unobserved. Manton et al. (1994) consider this issue in the more general framework where data
are available in the form of closed-ended responses to st~vey questions. In that case the objects
are assumed to belong to latent categories, and the classification probabilities are not uniquely
determined by the observed responses (Manton et al., 1994, pp. 24-28, 53-66), but can be further constrained so that they are identifiable (Manton et al., 1994, pp. 69-70). We show that by
considering the simpler but still quite general classification problem, where expert judgements
are actually available on how objects should be classified to K prespecified nominal categories,
the estimands are unique, under relatively mild conditions, as long as the sample includes the
classification of objects that lie exclusively in each of the nominal categories. These results also
suggest important ways of quantifying the various aspects of judgement accuracy and provide a
guide as to what is ultimately possible with more general forms of the GoM model. We focus
on the conditional form of the GoM model because it provides a practical way to find estimates
of classification probabilities, especially when J is considerably larger than K. Therefore, the
conditional model extends the range of practical applications to settings where it is necessary or
455
B R U C E COOIL A N D SAJEEV VARKI
natural to consider a large number of judges. Such applications include: (a) screening a larger
number of judges to determine a subgroup of true experts, and (b) cases where a large number of
categories and judges are necessary to accommodate the complexity of the objects being studied.
Specific examples include the use of psychological test items, test-takers, patients, or customers
to evaluate individual psychological profiles, test procedures, patient care, or the quality of products or services, respectively.
The conditional G o M likelihood is a multinomial distribution, conditional on fixed G o M
vectors (gi 1, . . . , giK), i = 1, . . . , I, where the parameters of primary interest are the unknown
classification probabilities. Following Manton et al. (1994), the classification probabilities are
estimated directly from this likelihood along with the G o M vectors. The estimates of these vectors are consistent only in the sense that their empirical J t h order moments converge to those
of the gik-distribution, Fg, from which the row vectors (gil . . . . giK), i = 1 . . . . . I, are drawn.
Nevertheless, the estimated classification probabilities are unconditionally consistent (Tolley &
Manton, 1992, Theorem 4.1, p. 92) and the j oint asymptotic distribution of the estimated classification probabilities and moments of Fg allow approximate Chi-square tests based on likelihood
ratio statistics (Manton et al., 1994, pp. 75-82).
In section 2, we discuss how the G o M model can be applied to the nominal classification problem, present sufficient conditions for identifiable classification probabilities, and study
additional conditions that ensure that the identifiable correct classification probabilities are the
greatest lower bounds among all solutions that might correspond to the observed multinomial
process. Proofs are provided in Appendix A. The asymptotic distribution of these estimators
then follows from Tolley and Manton (1992). These results are also briefly summarized in section 2. Section 3 illustrates the application of the model on a simulated data set. A procedure for
selecting starting values and a summary of the an estimation algorithm are found in Appendix B.
In section 4, we study the accuracy of the estimation method using simulated data that consists
entirely of mixtures, although some are nearly crisp. Our results and conclusions are summarized
in section 5.
2. A Model for the Classification of Mixtures
Assume that each of I objects are assigned to one of K nominal categories by each of J
judges. Although each object is potentially a mixture, each judge is asked only to classify it to
the most appropriate single category. Thus, a strength of the model is that it does not require data
that are any different from what is available when latent class models are used. Let Yijk represent
the indicator variable for whether or not judge j classifies object i to category k:
Yijk = 1, if judge j classifies item i to category k
= 0, otherwise
(2)
The model assumes that the Yijk are observed realizations of a random variable Yijk, and that
judge j , j = 1 . . . . . J, classifies object i, 1 < i < I, to category k, k = 1 . . . . K, with a
p r o b a b i l i t y Pijk, where
K
Pijk = P[Yijk
= 1] = ~
£=1
gie)~ejk,
(3)
and Lejk represents the probability that judge j classifies an object that lies exclusively in category g (i.e., a "pure" type) to category k. The )~ejk are the classification probabilities and are
subject to the constraints,
K
)~ejk= 1,)~ejk > 0 ,
k=l
for a l l £ a n d j ,
£=
1. . . . . K , j =
1 . . . . . J.
(4)
456
PSYCHOMETRIKA
Equation (3) provides the operational definition of the GoM values (gik) in terms of their relationship to the actual classification probabilities for mixtures (Pijk), given the underlying probabilities governing the classification of pure types ()Vejk) of (4).
If judges make independent classifications, then conditional on the gik, the multinomial
random variables Yijk are independent for different values of i (objects) and j (judges). In what
is referred to as the "unconditional likelihood function" for the GoM model, an expectation
E{.} is taken with respect to the K-dimensional distribution of the unobserved GoM vectors
(gil . . . . . giK), 1 < i < I, which has support on the (K - 1)-dimensional simplex defined in (1)
(Manton et al., 1994, p. 23; Varki, Cooil, & Rust, 2000, pp. 483, 488-489):
LGoM=
E
i=l
...
j=lk=l
(pijk)Y~J k
}
=
E
=
K {(i~ I
i=1 kl=Âk2=l kj=l \j=l
I
KK
tj=lk=l
) {i~I
j=l
}}
where the penultimate expression follows from (3) and, in the last expression (5), the xi,j denote
the actual category to which judge " j " assigns object "i". The likelihood in (5) is an integrated,
or marginal, likelihood with respect to the gik-distribution. Although the gik that correspond
to a specific object i are not parameters per se, (5) is still "unconditional" in the sense that
it requires the joint estimation of all J-th-order factorial moments of the gik-distribution, or
( J + K - 1 ) ! / [ J ! ( K - 1)!] - 1 parameters, in addition to the )Vejk, which are J K ( K - 1) additional
parameters (given the J K constraints of (4)). This is a formidable estimation task, even when the
)Vejk are constrained to be mathematically identifiable, because the number of factorial moments
becomes very large when there are more judges (J) than categories (K), and K _> 4 (e.g., if
K = 4, J = 8, there are 165 additional moment parameters; if K = 5, J = 10, there are 1001).
Thus, even with large samples it is typically difficult to maximize (5) so that the estimates are
locally identifiable (i.e., the estimated information matrix is typically singular). Manton et al.
(1994, p. 71) discuss other numerical difficulties with maximizing (5). Perhaps for these reasons,
we have not seen a published application that uses either (5), or the counterpart available for latent
categories (Manton et al., 1994, pp. 23, 6 7 - 6 9 ) . One way of reducing the parameter space is to
posit a specific parametric form for the gik-distribution (e.g., Varki et al., 2000). A more general
alternative approach uses the conditional likelihood, which is the multinomial distribution of the
Yijk with respect to the Pijk probabilities in (3), conditional on the actual gik values,
I J K
LCGoM= I-I I-I I-I /~jk
~yijk
(6)
i=lj=lk=l
= r-I jl--I1 r-I
i=1 '= k=l £=1
gieLejk
,
(7)
where (7) follows from (6) and (3). This conditional likelihood allows the investigator to focus on
the estimation of the classification probabilities )Vejk, which consists of J K (K - 1) parameters,
given constraint (4). The gik values are treated as missing data from an unspecified distribution.
Given constraint (1), the gik have I (K - 1) degrees of freedom. Following Manton et al. (1994,
p. 68), the gik are estimated in the penultimate step of repeated iterations of an algorithm designed
to maximize profiles of the conditional likelihood (7), first with respect to the gik, and then
with respect to the )Vejk (see Appendix B). The final estimation of the )Vejk, and the associated
significance testing, require only that the estimated conditional information matrix for the )Vejk
of (7), conditional on estimates of the gik, be positive definite.
457
BRUCE COOIL AND S A J E E V VARKI
The maximization of (7) provides consistent estimates of the LeA, along with values of the
GoM vectors (gil . . . . giK), i = 1 , . . . , I, which have empirical J-th order moments that are
consistent estimates of the J-th order moments of the gik-distribution, Fg (Tolley & Manton,
1992, Theorem 4.1, p. 92). Consequently we are effectively using (7) to estimate the parameters
of the marginal distribution in (5). Thus, estimability will still require that the degrees of freedom
of the model (7) exceeds the number of parameters in model (5), that is,
K J > JK(K
-
1) + ( J q- K - 1)!
J ! ( K - 1)!
1.
(8)
Since there are K categories and J judges, there are K J ways in which the judges can classify a
given object and this must exceed the number of parameters that would be estimated in model (5);
that is, the J K ( K - 1) classification probabilities ;~ejk, and the (J + K - 1)!/[J!(K - 1)!] - 1
factorial moments of Fg. Note that the conditional likelihood (7) contains a potentially larger
number of I ( K - 1) nuisance parameters (the gik suNect to constraint (1)) than the likelihood
in (5) (Tolley & Manton, 1992, pp. 91-92). Nevertheless, assuming (8) and a sufficiently large
sample, the distribution of the conditional likelihood (7) is anchored by the sample moments of
the gik-distribution, so that under general conditions it should be possible to obtain a reasonably
accurate estimate of the ),gjk (Manton et al., 1994, p. 67). General estimation problems of this
type are also considered by Kiefer and Wolfowitz (1956), who would refer to the classification
probabilities as "structural" parameters, in contrast to the "incidental" gik parameters.
2.1. Sufficient Conditions for an Estimable Model
Let P be the I x J K (K nested within J) matrix of {Pijk} defined in (3): P = [P1, P2 . . . . . P J],
so that Pj refers to the classification probabilities for judge j,
Pj .
.
Pljl
.
Pljl
... PljK 1
. .
... PljK
l<j<J.
(9)
Given the classification probabilities for each object P = {Pijk}, (3) can be written in matrix
form as (Manton et al., 1994, p. 24):
P=GA,
(10)
where G is the I × K matrix of {gik}; and A is the K × J K matrix of {)vejk}:A = [A1, A2 . . . . . A j],
Aj .
--- )VljK l
.
.
.
)VKj1 • • • )'.KjK
I ~ljl
.
l<j<J.
(11)
If we begin only with the definition that Pijk =-- P [Yijk = 1], it can be shown that the representation in (10) is always possible (Woodbury, Manton, & Tolley, 1994, pp. 153-154; also Manton et
al., 1994, pp. 25-27) and that A can be defined uniquely so that its columns are extreme profiles
of a convex hull defined within the probability space generated by the profile vectors P (Woodbury et al., 1994, p. 154). When A is unique, model (5) is estimable (and identifiable) whenever
(a) J > 2K, and (b) there is a nonsingular submatrix of E[P~P] of dimension K x K with no
diagonal elements (Woodbury et al., 1994, pp. 152-166; especially Theorem 3, p. 160; also Manton et al., 1994, pp. 53-63). Condition (a) is generally regarded as more restrictive than (b), and
in this case the number of factorial moments in (5) is particularly large, so that frequently the
conditional model (7) is the only practical alternative. Note that Condition (b) is satisfied whenever one of the K x K off-diagonal submatrices of the ti)rm E(P}I P J2), Jl ~ J2, is nonsingular,
where Pj is defined in (9). The g-th column of this submatrix is proportional to the vector of K
458
PSYCHOMETRIKA
conditional probabilities
{P[Yij2e = 1] . . . . . P[Yij12K = llYij2e = 1]}
for the Yijk of (3). If we take a Bayesian perspective and imagine that each such column is
drawn independently from a distribution with support on the ( K - 1)-dimensional simplex, then
E(PS1P j2 ) would be of full rank with probability 1, since otherwise one of the columns represents
a point in the (K - 2)-dimensional subspace of the original simplex defined by linear combinations of the other columns, and this can only happen with probability zero. (Similarly, E(PS1 P J2)
would be nonsingular with probability 1 if it is randomly selected from a distribution that has
continuous support on the K 2 - 1 dimensional simplex.)
2.2. Identifiability of A
Manton et al. (1994, p. 24-25, 69-70) show that G and A are not unique if there is a
nonsingular matrix A (different from the identity) such that the elements of G* = G A -1 satisfies
(1), and such that the elements of A* - A A satisfy (4). If such a nonsingular matrix A exists,
then P of (6) can also be written as (Manton et al., 1994, p. 25),
P = GA = (GA-1)(AA) = G'A*,
(12)
so that G* and A* would provide alternative grade-of-membership values (g'k) and classification
probabilities ()~ejk)" Constraints (1) and (4) require that G* and the J submatrices of A* =
[A~ . . . . . A}] be stochastic matrices (i.e., each of these matrices must have nonnegative elements
and the elements of each row must add tol). To summarize these conditions in matrix form, A
must be a nonsingular K x K matrix such that G* = G A -1 is a stochastic matrix, that is,
G*! = !,
and
G* = {g*k : gi~ > 0, 1 < i < I, 1 < k < K}
(13)
(where _1 represents the appropriately dimensioned column vector of ones) and such that the
submatrices of A* - [A~ . . . . . A}] - [AA1 . . . . . A A j ] are each stochastic matrices,
A~!=!,
and
A j = { L e j k : Lej k >_0,1 <_ £K, 1 < k < K}, l <_ j <_ J.
(14)
Thus, even if we knew the object classification probabilities P, there could be many possible
matrices A* and corresponding matrices G* that would generate the corresponding object classification probabilities P via (10) and still satisfy the constraints imposed by (13) and (14).
2.3. Sufficient Conditions for the Identifiability of A
Theorem 1. If G includes all K pure types, there is a unique A such that P = G A . Also,
if there is a P = ( J A that maximizes (6), conditional on G, then A is uniquely determined by
and G, whenever G includes all K pure types. (This is proven in Appendix A.)
G will satisfy this condition whenever a sufficiently large sample is drawn from a population
of objects that includes all K pure types. On the other hand, suppose we also want to consider
classification probabilities A that satisfy P = G A , for grade-of-membership matrices G that may
not include all K pure types. In this case, the unique estimand A, that satisfies the condition of
Theorem 1, is still important because it provides the greatest lower bounds of all possible correct
classification probabilities {A~j k : k = 1 . . . . . K, j = 1 . . . . . J} (i.e., the diagonal elements of
the A j*, 1 _< j _< J ) that satisfy (10), subject to the constraints (13) and (14), whenever we are
willing to assume a certain minimal level of classification accuracy. This result also provides
a method for estimating the lower bound of the overall reliability of the I classifications. The
following condition on classification accuracy for a given category is sufficient in this case.
BRUCE COOIL AND SAJEEV VARKI
459
C1. For some category k, k = 1 , . . . , K, there is at least one judge who correctly classifies a
pure type with a probability greater than that judge's largest probability of misclassifying a
pure type to the same category, that is, for some category k, 1 < k < K, there exists at least
one judge j (k), j = 1 . . . . . J, such that
)Vkj (k) k > maximum {)re j (k) k : g = 1 . . . . . K, g ¢ k}.
(15)
Condition C1 can be interpreted to mean that for category k, at least one judge's classifications
are reliable enough to ensure that if a crisp (or pure) object has been classified to that category,
then it is more likely to be from that category than from any other single category. We have the
following theorem.
Theorem 2. A s s u m e that the object classification probabilities P can be represented as
P = G A , subject to the constraints of (1) and (4), where G consists of K pure types. Also
assume A satisfies condition C1 or some category k, k = 1. . . . K, and let j (k), j (k) = 1 . . . . J ,
represent any judge who fulfills the accuracy constraint of (15) for category k. Under these conditions, the diagonal element )Vkj (k) k of A is the greatest lower bound of the correct classification
*
probabilities ,kkj(k)k
from all A * that satisfy C1, and for which P = G ' A * , even when the corresponding G* does not include all K p u r e types. Also, if C1 is satisfied for each category k,
1 < k < K, then there is a unique A with diagonal elements )vl j(1) 1, )v2j(2)2, . . . , )Vgj(g)g.
(This is proven in Appendix A.)
The following corollary is an immediate consequence of this theorem and can be used in
most applications where the judges are experts.
Corollary 1. If in addition to the conditions of Theorem 2, the constraint (15) is satisfied by
A for all judges, j ( j = 1. . . . . J ) , and categories k(k = 1. . . . K), then each diagonal element
,kkjk of A is the greatest lower bound among all possible correct classification probabilities ,k~jk
that come from matrices A* that satisfy (15) for all judges and categories, and for which P =
A* G*, even when G* does not include all K pure types.
Theorems 1 and 2 do implicitly assume that model (5) is estimable, and the uniqueness
provided by these theorems is meaningful because the G and A that maximize (7) are consistent
under very general conditions (Tolley & Manton, 1992, Theorem 4.1, p. 92) whenever A has
been sufficiently constrained so that it is identifiable from P; (_, is consistent in the sense that its
empirical J-th order moments converge to those of the gik-distribution, Fg, from which the row
vectors (gil . . . . giK), i = 1. . . . . I, of G are drawn. Also, the asymptotic joint distribution of
and the estimated moments of Fg allow approximate chi-square tests based on likelihood ratio
statistics (Manton et al., 1994, pp. 75-82). For these results, Fg need not be continuous, but must
have at most a countable number of points with positive probability.
Theorems 1 and 2 are especially useful in applications because it is possible to check directly
whether the estimates of G and A meet the conditions. If the sample includes all K pure types,
the two basic assumptions of Theorem 2 are easily met whenever there is at least one expert judge
for each category. On the other hand, suppose we believe that all pure types are represented in
the sample, but we obtain an estimate of (_,(o) of G that does not include all pure types (and
assume A(O) represents the corresponding estimate of A). Then a natural transformation would
be to postmultiply (_,(o) by a matrix B -1 so that (_, = (_,(°)B-1 does include all pure types and
to define A as A = B A (°) (B is then a specific stochastic version of the matrix A of (12)). A
transformation of this type is not always possible when K > 2. When K = 2, B is the matrix
formed from the two rows of G(O) that have the largest ~(o)
values for each category k,
ik
460
PSYCttOMETRIKA
[ 1
(16)
where
_g(°/= [max{~}])): 1 < i < 1 }
il
l-lnax{)}°):
1<i<I}]
(17)
~(°/=[l-max{~}°):l<i<I}
lnax{)}°): 1 < i < I } ] .
i2
In this case the rows of B are simply the rows of (3 (°) that are closest to the pure types.
2.4. An Illustration of the Lower Bound Property Under Condition C1
For a simple illustration of the lower bound property of the correct classification estimates
in Theorem 2, note that the matrix B in (16) is generally of the form
B=
1 - pq
1 -qp
I ' 0<p<l,0<q<l,
(16 I)
so that if either p or q are not 1, the transformed matrix has elements A, where A = BA (°), has
elements:
~ljl = / ~ ~"
J j l(0) + (1 - P)~j)I
~.~(o)
~2j2 = (1 - q)'~())2 + q'~2j2"
(18)
5(o)
Thus, the initial correct classification estimates, "~kjk, " = 1, 2, on the right side of (18), correspond to grade-of-membership values ~;(o) such that the "closest to pure" types are the two
rows of B referred to in (1 if) (where p and q are presumably close to 1). By transforming these
initial estimates to the estimates 2kjk, k = 1, 2, on the left side of (18), we always obtain correct
~(o)
classification probability estimates that are less than or equal to the initial estimates "~kjk because
the right side of each equation in (18) is a weighted average of the initial correct classification
~(o)
probability estimate "~kjk with an error probability estimate that must be smaller than ~.(k~)k,under
condition C1. The new estimates Lkjk (on the left side of (18)) correspond to a G matrix that includes pt~e types in both categories. If G(O)A(O) = I~A maximizes (7), the ~.kjk are estimates of
the greatest lower bounds of the correct classification probabilities and the corresponding matrix
is an estimate of the unique estimand A of Theorem 2.
2.5. Interpreting Condition C1
The classification accuracy condition C1 has a simple interpretation if we adopt terminology
that is typically used in diagnostic testing. Define a judges "sensitivity" for category k as the
probability of correct classification (Lk j (k) k), and define the minimum "specificity" of judge j for
category k as the minimum probability, across all categories £, £ ¢ k, that the judge will classify
an object from ~ to some other category besides k (which is one minus the maximum probability
of misclassifying an object from another category g to category k, 1-maximum {Lej(k)k : ~ =
1 . . . . . K, g ¢ k}). Then condition CI is equivalent to requiring that for a given category k,
there is a judge with a minimum specificity that exceeds one minus that judge's sensitivity for
category k. In summary, if CI* represents the assumption that C1 is satisfied for all categories
and all judges, then CI* may also be described as:
BRUCE COOIL AND SAJEEV VARKI
461
CI*. for all j and k ( j = 1 . . . . . J, k = 1 . . . . . K ) , {judge j ' s minimum specificity for category k} = 1-maximum {)vej(k)k : g = 1 . . . . . K , g ~ k} > 1 - )Vkj(k)k = 1-{judge j ' s
sensitivity for category k}.
Consequently, C 1' is satisfied whenever all judge classifications are sufficiently sensitive or specific for each category. This is another way of stating the condition of the Corollary. This condition differs from the following condition C2, that is typically assumed when reliability estimates
are calculated using classification probabilities (see Cooil & Rust, 1995, p. 202).
C2. Each judge correctly classifies each pure type more often than she/he incorrectly classifies
it to any other single category: for each j and k(1 < j < J, 1 < k < K )
)~kjk > maximum {)Vkjg. : g = 1 . . . . .
K , g ~ k}.
In contrast to condition C 1', condition C2 requires that judges have sufficient sensitivity for
each category. Condition C2 is not assumed in the Corollary to Theorem 2 (although it is also
a reasonable minimal qualifying condition when selecting judges) but is equivalent to condition
CI*, when there are only two categories ( K = 2). Technically when K > 2, CI* is not equivalent
to C2, nor does either condition imply the other. But if either CI* or C2 is satisfied, and not both,
it is because judges tend to make specific types of misclassifications more frequently than others.
For example, CI* and C2 are equivalent if we make the additional assumption that all types of
misclassification errors occur less frequently then they would if all judges were classifying each
object randomly, that is, CI* and C2 are equivalent whenever:
1
)Vejk <
for all j , j = 1 . . . . . J, and for all k and g, g ¢ k.
3. Simulated Illustration
We consider a simulated data set where each of 8 judges classify 800 objects into 4 categories ( I = 800, J = 8, K = 4 in (7)). Half (400) of these objects have G o M vectors,
[gi 1, gi2, gi3, gi4], that are drawn randomly from the flat Dirichlet distribution with density
fG_MIX(Xl, X2, X3, X4) = F
O~k
k=l
(19)
k=l F (C~k)
= 3!, whereo~k = 1, 1 < k < 4,
(20)
for 0 _< Xk _< 1, 1 < k < 4, such that ~ 4 = 1 X k = 1. The other 400 G o M values are drawn from
a symmetric "bowl-shaped" Dirichlet that puts most of its mass near the four pure types,
4 Xk0.95
f N C ( X l , X2, X3, X4) = F(0.2) E F(0.05)"
k=l
(21)
Given this equal mixture of distributions (20) and (21), the expectation is that for each of the 4
categories, 9% of the 800 objects will be "nearly crisp" in the sense that gik, >_ 0.90, for each
category k, 1 < k < 4.
3.1. The Judges
To make the estimation of classification probabilities more interesting, we used three types
of judges: (a) "experts" ()~kjk, >_ 0.7, for all k, 1 < k < 4; see the first 4 judges in Table 1); (b)
462
PSYCHOMETRIKA
TABLE 1.
The Actual and Estimated Classification Probabilities Based on 800 Objects. 1
Correct Classification Probabilities (%)2
Judge
Selected Misclassification
Probabilities (%)
100% x )Vlj 1
L2j2
)~3j3
~4j4
Llj2
)~3jl
90
(89, 92)
70
90
(88, 93)
70
90
(89, 93)
70
90
(90, 93)
70
3.3
(2.2, 4.1)
10
3.3
(2.3, 4.8)
10
(70, 77)
(74, 80)
(71, 78)
(68, 78)
(6.0, 10)
(7.8, 12)
90
(87, 91)
70
(75, 80)
90
(88, 91)
70
(74, 81)
70
(72, 79)
90
(89, 94)
70
(73, 76)
90
(86, 91)
3.3
(4.6, 6.6)
10
(3.7, 8.6)
10
(6.3, 11)
3.3
(1.5, 3.1)
90
(88, 91)
50
(49, 58)
90
(87, 91)
50
(50, 59)
50
(46, 58)
90
(87, 91)
50
(48, 57)
90
(88, 92)
3.3
(2.6, 5.9)
17
(11, 18)
17
(16, 21)
3.3
(2.0, 4.8)
65
(67, 71)
65
(64, 72)
65
(64, 69)
65
(66, 77)
65
(63, 68)
65
(70, 74)
65
(66, 71)
65
(65, 71)
12
(8.9, 13)
12
(10, 14)
12
(9.9, 14)
12
(6.7, 12)
Experts
Judge 1
Judge 2
Judge 3
Judge 4
Specialists
Judge 5
Judge 6
Not Reliable
Judge 7
Judge 8
1 Half of this sample is drawn from the flat Dirichlet of (20), and the other half from the "bowl-shaped" Dirichlet of
(21). For each category, the expectation is that 9% of the sample is composed of objects that are at least 90% from that
category (gik > 0.9).
2 The parenthetical values are the 1st and 3rd quaxtiles of the distribution of estimates from 10 replications.
"specialists" who only classify pure types from some categories with high accuracy (see Judges
5 and 6 in Table 1, where )~kjk is either 0.5 or 0.9); and (c) judges who do not reliably classify any
pure type Gkjk < 0.7 for all k, 1 < k < 4; see Judges 7 and 8 in Table 1, for whom )~kjk = 0.65,
1 < k < 4). In each case, judges also make classification errors randomly so that,
Lkje = [1 -- Lkjk]/(K -- 1), £ ¢ k.
Thus, all judges satisfy the conditions C1 (for all categories) and C2. The data, Yijk (2), are
generated as multinomial realizations using the Pijk (3), that are calculated from the Lkje and the
randomly generated gik.
3.2. Starting Values and Estimation
Procedures for selecting plausible starting values are of particular importance because there
are frequently local maxima at the boundaries of the parameter space for A. To find global estimates of A in the interior of the parameter space, it often helps to identify those objects that
are closest to pure types when selecting starting values. A simple, data-driven procedure for
estimating A (including the selection of starting values) is outlined in Appendix B.
Table 1 summarizes the 1st and 3rd quartiles of the distribution of estimates from 10 replications, where in each replication the judges classify a new sample of 800 objects (400 of which
are drawn from each of the two Dirichlet distributions of (20) and (21)). The quartiles illustrate
the typical accuracy of the estimates. Here the average correct classification probability in each
463
BRUCE COOIL AND SAJEEV VARKI
category is nearly 74%, and certainly, as this average decreases, a larger sample would be needed
to achieve the same estimation accuracy. The median absolute relative error of the correct classification estimates (across replications) increases as the estimand, )~kjk, decreases toward 0.5 and
increases as the error level, {~7,e#k )~ejk }, increases. In this small study, the mean absolute relative
error of the estimates went from a high of 12% when )~kjk = 0.5 and ~7,e#k )~ejk = 0.23, to a
low of 1.7% when )~kjk = 0.9 and ~7,e#k )~ejk = 0.1 (median absolute relative error ranged from
11 to 1.7%). Nevertheless, a sample size of 800 is a relatively small sample for an estimation
problem of this size: here there are 96 ( = J K ( K - 1)) classification probabilities to estimate,
only 36% of these objects are crisp enough so that gik > 0.9 for some category k, k = 1 , . . . , 4,
and the estimates themselves are based entirely on the relatively crude Bernoulli classifications
of (2) that are made by the 8 judges.
The inter-quartile ranges of Table 1 also illustrate that estimates are positively biased when
the estimand is greater than 0.5 and negatively biased when )~kje < 0.5. This attenuation makes
judges 7 and 8 appear more reliable than they really are. Still, the estimates do generally reveal
the categories in which the "specialists" and "experts" are most accurate, and they also correctly
identify the most reliable judges. Note that (15) was not imposed as a constraint in the estimation
procedure. The attenuation toward 0 and 1 is typical of latent class estimates of classification
probabilities and is studied more thoroughly in section 4, where we consider varying numbers of
categories and judges, and a much larger number of replications.
4. Simulation Study
The purpose of these simulations is to study the relative accuracy of estimators of the correct
classification probabilities, )~kjk, under a variety of conditions, where in every case the gik are
drawn from distributions where true crisp values never actually occur, so that even under the
most favorable mixture conditions described below, a fixed sample G can only approximately
satisfy the constraint in the theorems of section 2. We consider cases where there are 2, 3, or 4
categories, varying numbers of judges, different proportions of nearly crisp objects, and different
correct classification probability levels. We assume that the judges make errors randomly (so that
)~kje = )~kjeI whenever g ~ k, and gl ¢ k), but still consider the relatively general case where
judges make classifications with accuracy that may change by judge and category.
4.1. Simulation Design
Each sample is constructed so that either 50, 70 or 90% of the I objects have G o M vectors,
[gi 1, • .., gig], that are drawn from the symmetric and slightly peaked Dirichlet,
K
fmixture item(X1 . . . . .
XK)
=
F(2K)
(22)
I - I Xk,
k=l
and the remaining portion of G o M vectors (50, 30, or 10%, respectively) are drawn from the
nearly crisp Dirichlet,
~k
Jnea* crisp(X1 . . . . .
XK) = r
"k
rTgT) --
F(10(K
--
1))x9K_10 '
- 17
(23)
k=l
where o~e = 9 ( K - 1) and O~k = 1, when k ¢ g. In (23), g takes on the value of each category
(1 . . . . . K ) for an equal number of objects, so that 1 / K of this nearly crisp portion of the sample
is drawn from each category. Even in the case where 50% of the sample comes from the nearly
crisp distributions of (23), on average only from 28% (when K = 4) to 31% (when K = 2) of
the entire sample will consist of objects that are more than 90% from one category.
464
PSYCHOMETRIKA
As in section 3, the data (the Yijk of (2)), are generated as multinomial realizations of the
from selected values of ~kj~ and the randomly generated values of gik
(as described above). To ensure independent assessments of accuracy, we evaluate the estimates
of the )~kjk for only one category (k) in each sample of the judge classifications for I objects.
Pijk (3), that are calculated
4.2. Simulation Results
Figure 1 shows the absolute relative error (ARE = {l~.kjk - ),.kjkl/)~kjk} X 100%) of estimates for the correct classification probabilities in the worst cases, where the samples are either
70 or 90% true mixtures from (22) (i.e., the remaining 30 or 10%, respectivels; were nearly
crisp from (23)), and the corresponding correct classification probability, 100% x Lkjk, is either
60% or 90%. At these high mixture levels, the AREs are highest when the correct classification
level is 60% and lowest at 90%. The distribution of ARE, represented by each box plot, is based
on 192 replications, where the total random error level, ~-,e#k ,kejk x 100%, is uniformly distributed across the values 10, 20, 30, and 40% (48 replications each). Each replication is based
on the classification of 200K objects (i.e., the sample size, I, is 400, 600, 800 for K = 2, 3,
or 4, respectively), where the overall average correct classification probability (across judges),
~_,j ,~kjk/J, is 0.75 in each category k, k = 1, . . . , K. By design, the expected number of objects per category is constant (i.e., E [ ~ i gik/I] is the same for each k, k = 1 . . . . . K). This is
analogous to the typical design for empirical studies of classification processes where there are a
constant number of objects per category. In this case, the objects are not crisp.
Typically, one should only use the conditional model when the number of judges exceeds the
number of categories, J > K (ideally, J > 1.5K), but we have included cases where J = 4 when
K = 4, for illustration and in these cases the median ARE ranges from 3.4 to 35%. Otherwise
median ARE is always less than 20%, and usually less than 10%.
The cases in Figure 1 represent the least auspicious circumstances for accurate estimation
of the ),kjk. Median AREs tend to be substantially smaller when the mixture level is 50%. In this
Categories
Judges
4
2
6
8
4
3
6
8
4
4
6
8
50 - -
40
._
--
30 - -
o
+ t
l,
100% *~kjk
Mix Level (%) 7(
FIGURE 1.
Each box plot represents the distribution of absolute relative error (ARE) in the estimation of the correct classification
probability Lkjk for the indicated number of categories, judges, actual value of Lkjk, and mixture level (percent of sample
drawn from (22), the remaining proportion is drawn from (23)). The 95% confidence interval for median ARE is indicated
within each box. Each empirical distribution is based on 192 replications, where the total error level (100%x Y~g¢k Lgjk)
is distributed uniformly across the values 0.1, 0.2, 0.3, and 0.4 (48 replications each).
465
BRUCE COOII. AND SAJEEV VARKI
Mix Level (%)
Categories
Judges
70
50
2
3
4 684
4
6846
2
4 68468
3
90
4
46
2
8 468i4
3
6846
4
8
30-20-LLI
,
._>
cc
o
-10 --
\
-20 --
FIGURE 2.
For each combination of mixture level, number of categories and number of judges, the plot connects the median relative error when the correct probability levels are 0.6, 0.7, 0.8 and 0.9, respectively. Each point is the median of 192
observations (48 at each total error level,
= 0.1, 0.2, 0.3 and 0.4).
(~e/:k)~jk)
case, median A R E is uniformly less than 10% when J > K, except when K = 2, J = 4 and the
correct classification level is 0.6 or 0.7 (here median AREs are 18 and 14%, respectively).
Figure 2 shows how relative error ( = {[~.kjk -- "~kjk]/)Wk} X 100%) varies by data quality
and the dimensions of the problem. Median relative error (MRE) provides a way of studying
the typical bias of these estimates. At each combination of mixture level, number of categories
and number of judges, the four points plotted in Figure 2 are the M R E when )Wk is 0.6, 0.7,
0.8 and 0.9, respectively (going from left to right). Within each combination of mixture level
and number of categories, these plots show that the M R E generally decreases as the level of the
estimand ()Wk) increases and as the number of judges increases. M R E also generally increases
as the total error level (~,e#k Lejk) increases (not shown). Overall, the M R E runs from 7.5%,
when ;Lkjk = 0.6, to --3.6%, when Lkjk = 0.9. This positive bias in the estimates of the lower
correct classification probabilities was also noted in the empirical example of section 3.
4.3. The Effect of Average Con'ect Classification Rate
In the simulations summarized above, the average correct classification probability within
each category was held constant at 0.75. To study the effect that decreasing this average would
have on estimation accuracy, we did additional simulations with the samples consisting of 50%
mixtures (i.e., 50% of each sample was drawn from the symmetric Dirichlet of (22) and an equal
proportion of the remaining 50% drawn from each of the nearly crisp distributions of (23), for
= 1, . . . , K ) where the average correct classification rate was set at 0.65. We then studied the
change in estimation accuracy at the common levels of Lkjk and {~-,e#k )~ejk} that were used in
both sets of simulations (,kkjk = 0.6, 0.7, 0.8 and { ~ # k Lejk} = 0.2, 0.3, 0.4).
The change in A R E depended primarily on the values of J and K . In the 3-category, 8-judge
case, median A R E actually decreased by 1.6% in absolute terms from 6.0%, when the average
)Wk was 0.75, to 4.4%, when the average )Wk was 0.65 ! Otherwise median A R E increased from
2.6 to 12.3% in absolute terms, depending on J and K , as the average correct classification level
466
PSYCHOMETRIKA
fell from 0.75 to 0.65, and at the lower average correct classification level (0.65), the largest
median ARE was 17.0% (this occurred in the 4-category, 8-judge case).
5. Conclusion
The conditional GoM model provides a practical method of estimating classification probabilities when sample oNects consist of both mixtures and pure-types. Theorem 1 shows that if
G includes all possible pure types, the classification matrix A is unique. Furthermore, according
to Theorem 2, if we are also willing to assume that certain judges satisfy a minimum level of
classification accuracy for a particular category k (condition C 1), the estimate of the probability
that this judge will correctly classify an object from category k is also the greatest lower bound
among all possible A matrices that satisfy C1 (including those for which G does not include all
K pure types). Whenever the conditions of the Corollary of Theorem 2 are met, lower bound
measures of data reliability (or test reliability when the "judges" are a group of test items) can
be calculated directly from the estimates of A (e.g., see Cooil & Rust, 1995: estimates of lower
bounds for reliability follow immediately from expressions (22)-(23), p. 204, using "random
error model 1" as shown in Table 2, p. 214).
In some applications, the "judges" may actually be identifying latent categories that do not
directly correspond to the classification categories. For example, consider a psychological testing
framework, where test items play the role of judges that "classify" subjects (or patients) in terms
of several preselected categories. What if the test items really only identify latent categories?
The classification model (7) is directly applicable when a unique subgroup of latent categories
are mapped into each of the K classification categories (so that the L latent categories are in
mutually exclusive subgroups that are each subsets of a different classification category). On the
other hand, whenever a subgroup of more than one classification category corresponds to a single
latent category, it would be necessary to redefine that subgroup as a single classification category
(if it can be identified beforehand) to obtain the crisp objects needed in Theorems 1 and 2. If
the latent dimensions and classification categories are not directly related in one of these ways,
model (7) is at best an approximation of a more complicated process, and a more general GoM
model, that explicitly accommodates latent categories, may be needed (Manton et al., 1994).
In section 3, the conditional GoM model was used to estimate the classification probabilities
of 8 judges across 4 categories. The unconditional model would require the joint estimation of
A and 164 additional parameters in this case, which may not be practical without a considerably
larger sample. Another approach would be to assume a specific parametric family for the gikdistribution, but this would also become impractical as we try to anticipate mixture distributions
that could be significantly more complicated than those considered in section 4. Thus, the conditional GoM model makes it possible to consider a wider range of applications than is possible
with the unconditional model (or even specific parametric alternatives), including those in which
a relatively large number of judges and/or categories are used. These applications include cases
where it is necessary to screen a large number of judges to determine a subgroup of experts, and
simafions where a large number of categories and judges are necessary to gauge the complexity of
the objects. The "judges" may be test items that are used to evaluate psychological or psychiatric
subjects, or the "judges" may even be actually test-takers, patients, or customers, who are aske~
to evaluate test procedures, patient care, or the quality of products or services, respectively. The
simulations in section 4 indicate that one can generally expect relatively accurate estimates of the
classification probabilities (or under condition C 1 of section 2, accurate estimates of the greatest
lower bounds of the correct classification probabilities, across all possible grade-of-membership
values, G) when there are at least an average of 200 objects per category, even when there are
very few crisp objects. When the average correct classification level was 0.75, the median A R E
was usually well below 10%, which indicates that the median A R E would generally be well
below 5% if there were as many as a thousand objects per category (on average), even if only
10% of those objects were nearly crisp (and among these nearly crisp objects, nearly half could
B R U C E COOIL A N D SAJEEV VARKI
467
be less than 90% from one category). To attain the same accuracy, larger samples will typically
be required when the average correct classification level is lower, but even smaller samples will
suffice as the proportion of nearly crisp objects increases.
Appendix A
Proofs of the Theorems
Let A be a nonsingular matrix such that G* = G A -1 satisfies (13) and such that A* = A A
satisfies (14). (P and G must have K-dimensional support, so it suffices to consider nonsingular
K x K matrices A). First we show that both A and A -1 must have eigenvector 1, with corresponding eigenvalue 1. Note that for any j , j = 1 . . . . . J,
AAjl = 1
(A1)
because A~ satisfies (14) and A~ = A A j . But since
A jl = !
(A2)
(i.e., the elements of each row of A j must add to 1 since it is a stochastic matrix by definition),
A 1 = 1,
(A3)
by substitution of (A2) into (A1). Equation (A3) implies
A - 1 1 = 1.
(A4)
The proofs of both theorems will use the following lemma.
Lemma 1. If G includes all pure types, then A -~ is a stochastic matrix.
Remark 1. Recall that a stochastic matrix is defined as any matrix with only nonnegative
elements such that the elements in each row add to 1. The fact A and A -1 must have row elements
that sum to 1 is implied by (A3) and (A4), respectively.
Proof It remains to show the elements of A -1 are all nonnegative. Let a ke denote the (kg)th element of A -1, and assume it is negative (i.e., a ke < 0). Since G is assumed to include all
pure types, there exists a row i such that
gik = 1, gig = 0, g ¢ k.
(A5)
Now consider the (ig)-th element g* of G*, where G* = G A -1, so that
gi*~ = gik ake q- ~_. gir are = ake
rg=k
(A6)
where the last equality follows from (A5). But since a ke < 0 by assumption, we have that
g*e < 0 by (A6). This contradicts the assumption that A is such that G* = G A -1 satisfies (13)
and establishes that all elements of A - 1 must be nonnegative.
Proof of the Theorem 1. By assumption G includes all K pure types. Suppose there is another estimate G*, G* = G A -1, that also includes all K pure types. Then since G = G ' A , the
matrix A must also be stochastic by the lemma (following the same proof using A in place of
A - l ) . But the fact that A and A -1 are both stochastic implies that A is the identity matrix. Thus,
A is unique. Finally, by exactly the same argument (with G and A replaced by (J and A, respec-
468
PSYCHOMETRIKA
tively) if there is a f' = G A - 1 A A that maximizes (6), conditional on G, then A is the identity,
and A is uniquely determined by P and (_,, whenever (_, includes all K pure types.
[]
Proof of the Theorem2. Suppose there is another A* = A A and corresponding G* =
G A -1, such that A* also satisfies condition C1 and P = G ' A * . By assumption, G includes
all K pure types, so A -1 must be a stochastic matrix by the lemma. Let a k£ denote the (kg)-th
element of A -1. Then A = A - 1 A *, and for each category k, k = 1, . . . , K, there is a judge
j (k), such that
)~kj(klk =
y-~
..k£ . *
a
*
~ej(klk -< [maximum {)~ej(klk : g . . . .1,
£=1
*
a k£
, K}] x
L£=1
= )~kj(klk,
_[
(A7)
K l ak£ = 1, ak£ > 0 and the aswhere we use the fact that A -1 is stochastic, so that ~ e =
sumption that A* satisfies condition C1 for the same judge j and category k, so that maximum
{)~*ej(k)k : £ = 1 . . . . K} = )~ j(k)k" It follows that the diagonal element )~k j(k)k is the greatest
lower bound among all possible A* that satisfy condition C1. Furthermore, if C1 is satisfied for
each category k, k = 1, . . . , K , then A is the unique estimate with the corresponding diagonal
elements )~1j (1) 1, )~2j (2)2, .. •, )~f j ( f ) f ' if we can show that a strict inequality must occur in
(A7) for at least one category k whenever A is not the identity matrix. But a strict inequality does
occur in (A7), because A -1 is stochastic and A* satisfies condition C1, where a strict inequality
is assumed in (15): that is, if A is not the identity matrix, there must be at least one category k
such that a kk ¢ 1, so that
Lkj(k)k = ~--~ak£L*~j(k)k < )~j(k)k X
£=1
a k£
L£=1
= )~kj(k)k,*
_[
,
~K
k£
s i n c e ) ~ j ( k ) k > maximum{)~ej(k)k
: g ¢ k} and 2_,e=la
= 1,
ake
> O. O t h e r w i s e a kk = 1,
for all k, k = 1 , . . . , K, which would imply that A -1 is the identity (and therefore that A = A*).
[]
Appendix B
Selecting Starting Values and the Profile Likelihood Algorithm
Outline of the Estimation Procedure Used in Sections 3 and 4
This procedure consists of: (a) the selection of starting values, (b) estimation of G and A by
an iterative profile likelihood (PL) algorithm, and (c) final estimation of A by maximizing (7),
conditional on an estimate (_, taken from the penultimate step of the PL algorithm. For step (c),
we used the GAUSS constrained maximum likelihood module (Aptech Systems, 1995), which
also provided estimates of the covariances of the )~kje estimates. A GAUSS program for the entire
procedure is available from the first author.
Starting Values
Initially, when the classification probabilities, A, are unknown, crisp objects (i.e., pure
types) are indicated when a relatively large plurality of judges choose a single category and
relatively few judges choose any other single category. If Mi 1 (k) is the largest number of judges
that agree that object i should be classified to some category k, and if Mi2(k) is the second largest
number of judges that agree on any category other than k, k = 1 , . . . , K , then the most probable
crisp values are characterized by large values of Mi 1 (k) - Mi2 (k). To derive preliminary benchmarks for those values of M i l ( k ) - Mi2(k) that might indicate crisp objects, we considered a
469
B R U C E COOIL A N D SAJEEV VARKI
rather extreme scenario of low judgment accuracy, where (a) judges make errors randomly, (b)
the correct classification probability )~kjk is at least as large as the misclassification probability
)~kje, and (c))~kjk is at most twice the misclassification probability when K = 2 or 3 and when
K > 3, it is at most 0.5: this implies )~kjk < maximum {2/(K + 1), 0.5}, and that for a crisp
object i from category k,
E [ M i l ( k ) - Mi2(k)] <_ CEIL[maximum { J / ( K + 1), J ( K - 2 ) / ( 2 ( K - 1))}]
(B1)
where CEIL[x] represents the smallest integer greater than or equal to x.
Consequently, if the observed value of M i l (k) - Mi2(k) is at or above the bound on the
right side of (B 1), this would indicate that object i is crisp even if judge accuracy was relatively
low (and if judges are more accurate, even larger values of Mi 1 (k) - Mi2 (k) would be expected
for crisp objects). Here is a relatively simple procedure for selecting preliminary starting values
for the classification probabilities, A, and grade-of-membership values, G.
1. Object i is initially identified as crisp from category k, k = 1 , . . . , K, whenever M i l ( k ) Mi2(k) is greater than, or equal to, the bound in (B1):
M i l ( k ) - Mi2(k) > C E I L [ m a x i m u m
{ J / ( K q- 1), J ( K - 2 ) / ( 2 ( K
-
1))}]
so that for these objects we initially set gik = 1 and gie = O, g. 7k k.
2. Use the observed classifications to estimate the matrix of classification probabilities, A, assuming the objects selected in Step 1 are crisp and that modal selections are the correct categories.
3. For the objects that are not initially identified as crisp, set the preliminary value of gik equal
to the number of judges that classified object i to category k, divided by J.
4. Use the preliminary value of A, from Step 2, and the preliminary estimates of the gik, from
Steps 1 and 3, as starting values for the PL Algorithm (described below).
The bound in Step 1 is used to avoid the identification of mixture objects as pure types.
Still, some of the objects that are identified as crisp will be mixtures that have large values of
Mi 1 (k) - Mi2 (k), and some pure types will not be identified initially. Step 1 should be modified
to incorporate prior information about (a) the minimum classification accuracy of judges, and (b)
the minimum number of nearly crisp objects per category.
PL Algorithm
When the successive estimates of gik and )~kje converge, the following formulas maximize
the conditional likelihood of (7), first with respect to G for fixed A (this is the G-step in (B2))
and then with respect to A for fixed G (the A-step of (B3)):
1
G-step:
gik =
J
~-'g[i, klj],
7
(B2)
wh~e
g[i, k l j ] =
•(o)2(o)
ik k jx(i,j)
K
~-~ ~(o)z
gie A£Jx(i,J)
£=1
(~}(o)
ik and ,~o
kjx(i,j) represent estimates from the previous iteration) and x(i, j ) is the category to
which judge j classifies object i;
470
PSYCHOMETRIKA
I
yijgg(i, /¢lj )
A-step:
2kJ£ =
i=1
I(
I
(B3)
~,Yijgg(i, Rlj)
[=1 i=1
where Yijk is the indicator for whether judge j classifies object i to category k defined in (2).
Varki (1996, pp. 136-139) show's how (B2) and (B3) each maximize the corresponding profiles
of the conditional likelihood (7), when they converge. This is an iterative approximation to an
approach justified by Richards (1961), although we are relying entirely on the conditional likelihood (7). Manton et al. (1994, p. 68) provide a similar approach to the conditional GoM model
that is designed for latent categories (based on Woodbury & Clive, 1974) and Tolley & Manton
(1992) show that when A and G are selected to maximize the conditional likelihood (7), A will
be consistent (whenever P is estimable, and A and G are identifiable). Additional information
about the joint asymptotic distribution of A and the moments of Fg is provided by Manton et al.
(1994, pp. 75-76).
References
Aptech Systems. (1995). Constrained maximum likelihood. Kent, WA: Aulhor.
Batchelder, W.H., & Romney, A.K. (1986). The statistical analysis of a general Condorcel model for dichotomous choice
situations. In B. Grofman & G. Owen (Eds.), Information pooling and group decision making (pp. 103-112). Greenwich, CN: JAI Press.
Batchelder, W.H., & Ronmey, A.K. (1988). Test theory wilhout an answer key. Psyehometrika, 53, 193-224.
Batchelder, W.H., & Romney, A.K. (1989). New results in test lheory without an answer key. In Edward E. Roskam
(Ed.), Mathematical psychology in progress (pp. 229-248). Berlin, Heidelberg, New York: Springer-Verlag.
Berkman, L., Singer, B., & Manton, K.G. (1989). Black/white differences in health status and mortality among the
elderly. Demography, 26, 6614578.
Blazer, D., Woodbury, M.A., Hughes, D., George, L.K., Manton, K.G., Bachar, J.R., & Fowler, N. (1989). A statistical
analysis of the classification of depression in a mixed community and clinical sample. Journal of Affective Disorders,
16, 11-20.
Chavez, J.M., & Buriel, R. (1988). Mother-child interactions involving a child with epilepsy: A comparison of immigrant
and native-born Mexican Americans. Journal of Pediatric P~ychology, 13, 349-361.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 3746.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit.
Psychological Bulletin, 70, 213-220.
Cooil, B., & Rust, R.T. (1995). General estimators for the reliability of qualitative data. Psyehometrika, 60, 199-220.
Dillon, W.R., & Mulani, N. (1984). A probabilistic latent class model for assessing inter-judge reliability. Multivariate
Behavioral Research, 19, 438-458.
Holland, J.L. (1985). Making vocational choices: A theory of vocational personalities and work em,ironments. Englewood Cliffs, NJ: Prentice-Hall.
Kiefer, J., & Wolfowitz, J. (1956). Consistency of the Maximum likelihood estimator in the presence of infinitely many
incidental parameters. Annals of Mathematical Statistics, 27, 887-906.
Klauer, C.K., & Batchelder, W.H. (1996). Structural analysis of subjective categorical data. Psychometrika, 61,199-240.
Manton, K.G., Woodbur}; M.A., & Tolley. H.D. (1994). Statistieat applications using fitz£y sets. New' York, NY: John
Wiley & Sons.
Pen-eault, \V.D., Jl:, & Leigh, L.E. (1989). Reliability o1"nominal dala based on qualitative judgments. Jou17~tl of Mal=
keting Research, 26, 135-148.
Richards, F.S.G. (1961). A method of maximum-likelih~x)d estimalion. Journal of the Royal Statistical Soeiety, Series B,
23, 469-476.
Tolley, H.D., & Manton, K.G. (1992). Large sample properties of estimates of a discrete grade of membership model.
Annals of the Institute of Statistieat Mathematics, 44, 85-95.
Tsai, C.Y., & Denton, J.J. (1993). Reliability assessment of a classroom observation system. Journal of Classroom Interaction, 28, 23-32.
Varki, S. (1996). New strategies and methodologies in customer satisfaction. Unpublished doctoral dissertation, Vanderhilt University, Nashville, TN.
Varki, S., Cooil, B., & Rust, R.T. (2000). Modeling fuzzy dala in qualitative marketing research. Journal of Marketing
Research, 37, 480-489.
Vertrees, J., & Manton, K.G. (1986). A multivariate approach for cla,ssifying hospitals and computing blended payment
rates. Medical Care, 24, 283-300.
Woodbury, M.A., & Clive, J. (1974). Clinical pure types as a fuzzy partifion. Journal of Cybernetics, 4, 111-121.
Woodbury, M.A., & Manton, K.G. (1982). A new procedure for analysis of medical classification. Methods oflnformation
in Medicine, 21,210-220.
BRUCE COOIL AND SAJEEV VARKI
471
Woodbury, M.A., Manton, K.G., & Tolley, H.D. (1994). A general model for statistical analysis using fuzzy sets: Sufficient conditions for identifiability and statistical properties. Information Sciences, 1, 149-180.
Yale, L., & Gilly, M.C. (1988). Trends in advertising research: A look at the content of marketing-oriented journals from
1967 to 1985. Journal of Advertising, 17, 12-22.
Manuscript received 7 JUN 1999
Final version received 24 NOV 2002