Nonparametric Two-Sample Methods for Ranked-Set Sample Data Michael A. F

Transcription

Nonparametric Two-Sample Methods for Ranked-Set Sample Data Michael A. F
Nonparametric Two-Sample Methods for
Ranked-Set Sample Data
Michael A. F LIGNER and Steven N. M AC E ACHERN
A new collection of procedures is developed for the analysis of two-sample, ranked-set samples, providing an alternative to the Bohn–Wolfe
procedure. These procedures split the data based on the ranks in the ranked-set sample and lead to tests for the centers of distributions,
confidence intervals, and point estimators. The advantages of the new tests are that they require essentially no assumptions about the
mechanism by which rankings are imperfect, that they maintain their level whether rankings are perfect or imperfect, that they lead to
generalizations of the Bohn–Wolfe procedure that can be used to increase power in the case of perfect rankings, and that they allow one
to analyze both balanced and unbalanced ranked-set samples. A new class of imperfect ranking models is proposed, and the performance
of the procedure is investigated under these models. When rankings are random, a theorem is presented which characterizes efficient data
splits. Because random rankings are equivalent to iid samples, this theorem applies to a wide class of statistics and has implications for a
variety of computationally intensive methods.
KEY WORDS: Mann–Whitney; Perceptual error models; Pitman efficacy; Ranking models; Thurstonian model; Wilcoxon.
1. INTRODUCTION
There are many experimental settings where it is far cheaper
to recruit a unit to participate in the study and to make an informal measurement on it than it is to make the formal measurement that is traditionally analyzed with statistical methods.
Ranked-set sampling provides a design that allows one to exploit this differential cost of recruitment and measurement.
A basic description of a ranked-set sample drawn from a single population is as follows: k2 units are collected as iid draws
from the population. These k2 units are partitioned, at random,
into k sets, each of size k. In the first set of k units, the response
judged to be smallest is chosen for full measurement; in the second set, the response judged to be second smallest is chosen; in
the third set, the response judged to be third smallest is chosen; and so on, until, in the last set, the response judged to be
largest (kth smallest) is chosen. This process is repeated, say,
m times, to produce a (balanced) ranked-set sample with mk
fully measured units. The full measurements, along with the associated ranks, constitute the data from the ranked-set sample.
When sampling more than one population, independent rankedset samples are drawn from each population.
The ranked-set sampling design was first described by
McIntyre (1952), who proposed it for estimation of pasture
yields, where a trained eye can order a set of plots by yield
fairly accurately. Since then, an extensive literature on the sampling plan has been produced, with particular attention given to
development of novel ranked-set sample estimators, comparison of ranked-set sample estimators to those based on simple
random samples, and selection of the set size. Review articles
by Kaur, Patil, Sinha, and Taillie (1995) and Bohn (1996) described much of this work and showed some of the wide variety
of applications to which the technique has been put. We note the
particular influence of Stokes and Sager (1988), who exploited
the connection between stratified sampling and ranked-set sampling (the k ranking classes loosely correspond to k distinct
strata) to consider estimation of a distribution function while
relaxing all assumptions on the ranking process. The general
Michael A. Fligner is Professor Emeritus, Department of Statistics, The
Ohio State University, Columbus, OH 43210 (E-mail: [email protected]).
Steven N. MacEachern is Professor, Department of Statistics, The Ohio State
University, Columbus, OH 43210 (E-mail: [email protected]). This material is based on work supported by the National Science Foundation under
awards DMS-00-72526 and SES-0437251 and by the National Security Agency
under award MSPF-04G-109.
tenor of results is that a ranked-set sample provides more accurate inference than does an iid sample of the same size.
Bohn and Wolfe (1992, 1994) developed and investigated a
version of the Mann–Whitney–Wilcoxon (MW) rank sum test
appropriate for use with balanced ranked-set sample data. They
found that their test outperformed the MW test based on independent random samples from the two populations. However, their test relies on the assumption that the units in a set
are ranked perfectly. When rankings are imperfect, the level of
their test rises, perhaps substantially. In this article we develop
statistical inference for a version of the MW test for data collected from two populations with independent ranked-set samples. The proposed test is distribution free. It does not depend
on either perfect rankings or exact knowledge of the mechanism
that yields imperfect rankings. The procedure works for both
balanced and unbalanced ranked-set samples. For shift alternatives, it leads to confidence intervals and, through the Hodges–
Lehmann device, to an estimator of the difference in locations
of the two populations.
Properties of the proposed procedure are investigated through
asymptotic theory and simulation. For the simulation, a new
class of models is proposed for imperfect rankings. The proposed procedure, which relies on minimal assumptions about
the judgment ranking process, is shown to be competitive at
both extremes for the quality of rankings. When judgment rankings are perfect, the new procedure is competitive with the
BW procedure; when judgment rankings are random, the new
procedure is asymptotically equivalent to the MW test.
Section 2 establishes notation, describes the statistic on
which the new tests are based, and provides the main result for
the asymptotic equivalence to existing procedures under certain
settings. This section compares the new method to the traditional MW test and then compares our new procedure to the
BW procedures. Section 3 provides a further description of the
theoretical results, focusing on the consistency of the test statistic and the Pitman efficacy of the hypothesis test. Section 4
introduces a new class of models for imperfect rankings. Section 5 gives the results of a simulation. Section 6 applies the
new method to an experiment on spray deposits, and Section 7
concludes with a discussion.
1107
© 2006 American Statistical Association
Journal of the American Statistical Association
September 2006, Vol. 101, No. 475, Theory and Methods
DOI 10.1198/016214506000000410
1108
Journal of the American Statistical Association, September 2006
2. PRELIMINARY RESULTS
2.1 Balanced Ranked-Set Samples
To construct a ranked-set sample (RSS) from a single population, the experimenter must select k mutually independent
random samples, each of size k. In the rth random sample, the
experimenter measures the item judged to be the rth smallest
observation among the k, r = 1, . . . , k. This process is repeated
m times with each replication of the process known as a cycle. This is referred to as a balanced ranked-set sample because
the same number of observations is taken on each judgment order statistic. In the two-sample situation, a balanced ranked-set
sample is obtained from each population. The necessary notation follows.
For population 1, let X[r]i denote the observation in sample r
of cycle i judged to be the rth smallest. The ranked-set sample of size M = mk from population 1 is X[1]1 , . . . , X[1]m , . . . ,
X[k]1 , . . . , X[k]m , where F(x) denotes the underlying continuous
distribution of this population. In addition, X[r]1 , . . . , X[r]m is
a random sample with cdf F[r] (x). If the judgment ranking is
perfect, then F[r] (x) = F(r) (x), the distribution of the rth-order
statistic of a random sample of size k from F(x); if the judgment ranking is made at random, then F[r] (x) = F(x). Similarly, we let Y[1]1 , . . . , Y[1]n , . . . , Y[k]1 , . . . , Y[k]n denote an RSS
of size N = nk from a second population with continuous cdf
G(y) = F(y − ), with Y[r]1 , . . . , Y[r]n a random sample with
cdf G[r] (y). Note that the sample from the Y population is based
on n cycles, although we are assuming that the set size (number of items ranked) is the same in both ranked-set samples.
Although there are important differences between ranked-set
samples and stratified samples, we sometimes refer to F[r] (x)
and G[r] (y) as the distributions of the rth strata, r = 1, . . . , k.
Bohn and Wolfe (1992) proposed a distribution-free test of
H0 : = 0 versus H1 : > 0 under the assumption of perfect
ranking. Their test statistic is given by
BW =
k n k m
ψ Y[s]t − X[i] j
s=1 t=1 i=1 j=1
=
k k
s=1 i=1
Tsi ,
where Tsi = nt=1 m
j=1 ψ(Y[s]t − X[i] j ) and ψ(x) = 1 if x > 0
and ψ(x) = 0 otherwise. The null distribution of the BW statistic is distribution-free under the assumption of perfect ranking. Although all arrangements of the mk X’s and nk Y’s are
no longer equally likely under the null hypothesis, the assumption of perfect rankings enables us to compute the probabilities
of the various rankings. These probabilities do not depend on
the underlying distribution, and so the distribution-free property holds. Interestingly, the BW statistic is also distributionfree when the rankings are random as it then has the same
distribution as the MW statistic based on independent samples
of size mk and nk. Thus, the distribution of BW under random ranking is quite different from its distribution under perfect ranking, and the level of the test under random ranking is
greatly inflated. The intermediate case of imperfect ranking is
studied in Bohn and Wolfe (1994), where it was shown that the
effect of milder forms of imperfect ranking is also to inflate the
level of the BW test.
Our proposed test is based on the statistic
T=
k
Tii .
i=1
The null distribution of each Tii is that of the MW statistic based
on sample sizes m and n, because under the null hypothesis
F[r] (x) = G[r] (x), r = 1, . . . , k. This null distribution for each
Tii remains the same whether the judgment ranking is perfect,
imperfect, or random. In the case of imperfect rankings, we
make only the reasonable assumption that the mechanism for
imperfect rankings is the same for both populations under the
null hypothesis. Because the Tii are mutually independent, the
null distribution of T is that of the convolution of k independent MW statistics based on sample sizes m and n whether the
judgment ranking is perfect, imperfect, or random. An immediate advantage of the proposed statistic T over BW is that the
distribution-free property does not require perfect rankings.
An inspection of the formula for the proposed statistic T
shows that it is based on a total of kmn comparisons among the
X’s and Y’s, whereas the BW statistic is based on k2 mn comparisons, as is the standard MW statistic for samples of size mk
and nk. The considerable reduction in the number of comparisons suggests that in the case of perfect ranking there may be a
loss of information relative to the BW statistic. We might expect
an additional loss of information because our procedure is valid
under perfect, imperfect, or even random rankings. Fortunately,
in those cases where information is lost, the effect is minimal.
Our proposed test assumes that the data have been collected
as independent ranked-set samples. The judgment ordering can
be perfect as was assumed by Bohn and Wolfe, although this assumption was quite strong, particularly as the number of items
ranked increases. Imperfect judgment ordering occurs in various degrees with the most extreme case corresponding to complete randomness; that is, the item judged as the rth smallest is
equivalent to having chosen an item at random from the k observations in the set. The purpose of using ranked-set samples is to
gain information over simple random sampling. If we knew the
judgment ordering was going to be completely at random, there
would be little reason to collect data as a ranked-set sample, and
we could just collect simple random samples from each population and use the standard MW statistic for comparing two populations. However, we now show that the proposed statistic T is
asymptotically equivalent to the MW if the rankings are completely random, whereas it will be shown in the next section
that with more informative judgment orderings it will outperform the MW statistic based on comparable sample sizes. This
result for random rankings is somewhat surprising as our test is
based on kmn comparisons among the X’s and Y’s, whereas the
MW statistic is based on k2 mn comparisons.
Under random ranking, the joint distribution of the rankedset samples is the same as a random sample of M = mk X’s
with cdf F(x) and an independent random sample of N = nk
Y’s with cdf G(y). Now suppose we divide the X’s at random
into k groups of size m and the Y’s at random into k groups of
size n. Let MWi be the MW statistic computed on the ith group
Fligner and MacEachern: Nonparametric Two-Sample Methods
1109
of X’s and ith group of Y’s only, and let
RMW =
k
MWi .
i=1
We call RMW the k random-split Mann–Whitney statistic. Letting MW denote the full MW statistic computed on all M = mk
X’s and N = nk Y’s, we now show that the difference between
MW and RMW, suitably normalized, goes in probability to 0.
d
d
Because, under random ranking, T = RMW and MW = BW,
d
where = denotes equal in distribution, the theorem also shows
the asymptotic stochastic equivalence of the BW statistic and
the proposed statistic T under random ranking.
Theorem 2.1. Let X1 , X2 , . . . , Xkm and Y1 , Y2 , . . . , Ykn
denote independent random samples from populations with
continuous cdf’s F(x) and G(y), respectively. Let k be a fixed
constant, M = km, N = kn, and assume 0 < λ = limM,N→∞ (M/
(M + N)) < 1. Then
√
RMW
MW
−
M+N 2
kmn
k mn
converges in probability to 0 as M, N → ∞.
Proof. Dividing the X’s at random into k groups of size m,
we let Xr,1 , . . . , Xr,m denote the rth group of X’s, and dividing
the Y’s at random into k groups of size n, we let Yr,1 , . . . , Yr,n
denote
the rth group of Y’s, r = 1, . . . , k. Then, letting MWi =
n m
t=1
j=1 ψ(Yi,t − Xi,j ),
RMW =
k
MWi
i=1
and
MW =
k n k m
ψ(Ys,t − Xi,j ),
s=1 t=1 i=1 j=1
where we note that the sum in MW is over all pairs of
X’s and Y’s. Lehmann (1951) extended Hoeffding’s (1948)
U-statistic theorem to two-sample U-statistics (see also Randles
and Wolfe 1991). Using this result, it follows from the projection of the MW onto sums of iid random variables that
1 MW
1 =
F(Ys,t ) +
(1 − G(Xi,j ))
2
km
k mn kn
k
k
n
s=1 t=1
m
i=1 j=1
1
.
+ op √
M+N
(1)
Applying the projection to the individual MWi gives
1
MWi 1 =
F(Yi,t ) +
(1 − G(Xi,j ))
mn
n
m
n
m
t=1
j=1
+ op
1
,
√
M+N
i = 1, . . . , k.
Now, by combining (1) and (2), it follows immediately that
k
1
MW
RMW 1 MWi
.
=
= 2
+ op √
kmn
k
mn
k mn
M+N
i=1
(2)
Theorem 2.1 relies on the description of the MW statistic as
a nearly linear function of the data. If the statistic were exactly
linear, as is the difference in sample means, there would be no
loss of information in a k random split. That is, Y −X, the difference in means of the kn Y observations and the km X observations, can be recovered exactly as the average of the differences
in means of the k random groups. Although the MW statistic
is not linear in the X’s and Y’s, Theorem 2.1 follows from the
approximate linearity of the MW statistic. Approximate linearity is a common property of estimators and has been used to
show that many classes of statistics have limiting normal distributions. Consequently, a similar argument for the equivalence
of a k-random-split statistic and the corresponding full statistic applies to a large group of estimators and test statistics. In
Section 5, it will be shown via simulation that, even for small
to moderate sample sizes, the RMW statistic loses little information relative to the full MW statistic. For some statistics that
are more complex, particularly those for which computational
time is worse than O(M + N), use of one or even several k random splits can be preferable to processing the complete dataset.
However, this must be weighed against the fact that the results
will depend on the random splits used.
The next result is a comparison of the
exact variances of the
MW and RMW statistics. Letting γ = F dG, both RMW/kmn
and MW/k2 mn provide unbiased estimates of γ . Using standard results for the variance of U-statistics (Randles and Wolfe
1991), it follows that
var(RMW/kmn) k[(m − 1)ζ0,1 + (n − 1)ζ1,0 + ζ1,1 ]
=
→1
(km − 1)ζ0,1 + (kn − 1)ζ1,0 + ζ1,1
var(MW/k2 mn)
as m, n → ∞, where ζ0,1 = F 2 dG − γ 2 , ζ1,0 = (1 −
G)2 dF − γ 2 , and ζ1,1 = γ (1 − γ ). Under H0 , for k = 3 and
m = n = 10, the preceding expression for the ratio of variances
takes the value 1.033, suggesting the near equivalence of MW
and RMW, even for relatively small sample sizes.
We now consider the case of perfect rankings. In this situation, the null distribution of the BW statistic depends on the fact
that comparisons are being made between F[r] (x) = F(r) (x) and
G[s] (y) = G(s) (y) for r, s = 1, . . . , k. Because of this, the distinction between the BW statistic and the proposed statistic T
involves more than just a reduction in the number of comparisons. Specifically, the set of comparisons used in the calculation of T are intended to be those comparisons that are more
informative for detecting the alternative. Because, under the
shift model, G(r) (y) = F(r) (y − ), r = 1, . . . , k, we would expect comparisons within the strata determined by the different
judgment order statistics to be the most useful for detecting a
shift due to the reduction in variability within the strata. Surprisingly, dropping the less informative (noisier) comparisons
can actually increase power. Thus, it is possible in some cases
for T to have greater power than BW, despite the reduction in
the number of comparisons.
For a better understanding, consider the class of statistics
Wc =
k
i=1
Tii + c
Tsi
s=i
for some constant c. For any choice of c, under perfect rankings the null distribution of Wc is distribution-free. This is because the probabilities of the various orderings of the mk X’s
1110
Journal of the American Statistical Association, September 2006
the balanced case, neither the mi nor the ni can depend on i.
From a design standpoint, imbalance can be useful because
not all judgment order statistics contain the same information
regarding a difference in the populations. Ozturk and Wolfe
(2000) investigated the design issue under the assumption of
perfect rankings and under the restriction that the mi (ni ) are
either equal to some common value, m (n), or are 0. In this
section, we consider a more general lack of balance, where the
mi and ni , i = 1, . . . , k, are arbitrary. Our focus is on the development of appropriate testing procedures within this context.
The proposed procedure T can easily be adapted to unbalanced designs, namely, those that do not use the same sample
size for each of the judgment order statistics. When the design
is unbalanced, the question arises as to how the Tii should be
combined to form the final statistic. For testing H0 : = 0 versus H1 : > 0, a natural statistic would be of the form
Figure 1. Plots of Pitman ARE When Rankings Are Perfect and
k = 2. A value of c = 0 corresponds to T , whereas a value of 1 corresponds to BW. The highest line is for the uniform distribution, the middle
line is for the normal distribution, and the lowest line is for the t distribution with 3 df.
and nk Y’s can be computed under perfect ranking and Wc depends on the data only through these orderings. The class Wc includes both the BW statistic (c = 1) and the proposed statistic T
(c = 0). Although dropping the cross-stratum comparisons
(c = 0) entails some loss of information, for many distributions
the BW statistic tends to overweight these comparisons. When
k = 2, the Pitman asymptotic relative efficiency (ARE) of BW
relative to the MW statistic was shown by Bohn and Wolfe
(1992) to be 1.5 for any distribution under perfect rankings. Because BW is just the Mann–Whitney computed on the rankedset sample, this efficiency results from the improvement in RSS
over simple random sampling (SRS) for perfect ranking. The
Pitman ARE of T relative to the MW statistic depends on the
underlying distribution, even under perfect rankings. The exact
expression will be given in the next section. However, Figure 1
shows the Pitman ARE between Wc and the MW statistic as a
function of c for the t distribution with 3 degrees of freedom,
the normal distribution, and the uniform distribution.
From Figure 1, we see that neither BW (c = 1) nor T (c = 0)
is optimal for any of these three distributions under perfect
ranking. Although the optimal c depends on the underlying distribution, it appears that for these three distributions a choice
of c around .6 would produce a test with reasonably good properties. We also note that, even under perfect rankings, for k = 2
the loss of information when using T instead of BW is minimal
for the normal distribution and is larger for the heavier-tailed
t distribution, but that T performs better than BW for the uniform. Finally, the weights in the expression for Wc can be made
more general because the optimal statistic would not necessarily treat all of the Tii or Tsi , s = i, in an equivalent manner.
Although an exploration of this might be interesting for very
small values of k, once k reaches 4 or 5 it becomes of less practical interest, as the assumption of perfect ranking becomes less
tenable as k increases.
2.2 Unbalanced Ranked-Set Samples
Suppose that mi and ni are the number of X and Y observations, respectively, in the ith judgment class for i = 1, . . . , k. In
Tα =
k
i=1
αi
Tii
,
m i ni
with weights αi suitably chosen to reflect the imbalance. For
simplicity, the weights should not involve the underlying distribution. In addition, the weights should ensure that Tα coincides
with the proposed statistic T in the balanced case and that, under random rankings, the procedures based on Tα should not
suffer relative to the MW. The rationale for this last requirement
is the same as in the balanced case. Because each statistic Tα is
distribution-free without the assumption of perfect ranking, if
the procedure is equivalent to the MW under random ranking,
then we wish to gain information provided the judgment orderings are informative.
Now suppose that the null hypothesis is true and that the
rankings are random. Then, for each i = 1, . . . , k for which
mi ni > 0,
1
Tii
=
E
m i ni
2
and
1
1
1 m i + ni
Tii
= σi2 =
+
.
var
m i ni
12 mi ni
12 mi ni
Because the Tii are independent, standard results show that, under the null hypothesis and for random rankings, var(Tα ) is
minimized by choosing
σ −2
αi = k i −2 .
t=1 σt
(3)
by (3). Let
Define Tα˜ to be Tα with
the weights specified
λi = mi /(mi + ni ), M = ki=1 mi , and N = ki=1 ni . Provided
λi → λ, 0 < λ < 1 for i = 1, . . . , k as mi , ni → ∞, it follows
easily that
var(Tα˜ )
→ 1.
var(MW/mn)
(4)
Thus, the proportions, (mi + ni )/(M + N), of observations in
the different strata determined by the judgment order statistics are not important; only that the balance, λi , between the
X’s and Y’s be the same within each stratum. A consequence of
this, proved in Section 3, is that the Pitman ARE of Tα˜ relative
Fligner and MacEachern: Nonparametric Two-Sample Methods
to MW is 1 under random ranking, provided λi → λ, 0 < λ < 1
as mi , ni → ∞ for i = 1, . . . , k.
The result in expression (4) has some rather surprising implications for the RMW, because, under random ranking, the
joint distribution of the ranked-set samples is the same as under
simple random sampling. Suppose that, under simple random
sampling from the two populations, there are a total of 30 X’s
and 30 Y’s. The usual MW is then based on 30 × 30 = 900
comparisons. The balanced RMW with k = 3 would randomly
divide the X’s into three groups of size 10 and the Y’s into three
groups of size 10 and then compute three MW statistics using
10 X’s and 10 Y’s for each MW. Adding these MW statistics
gives an RMW that is based on only 300 comparisons in total.
We have already seen that the exact ratio of the variances in (4)
in this case is 1.033; that is, the two statistics are nearly equivalent. Now suppose we randomly divide the X’s into two groups
of size 10 and 20 and the Y’s into groups of size 20 and 10.
Then we compute two MW statistics, the first using the 10 X’s
and the 20 Y’s and the second using the 20 X’s and the 10 Y’s,
violating the equality of the λi . Combining these two MW statistics using the optimal weights αi provided previously yields a
statistic based on 400 total comparisons, 200 in each MW. Evaluation of the ratio of variances in (4) gives a value of 1.143 for
this setting. Thus, despite the 100 additional comparisons over
the balanced RMW, the imbalance in the ratio of X’s to Y’s in
the “strata” produces a net loss!
To better understand this, suppose that instead of comparing
the groups via the MW statistic we used the difference in the
sample means. Thus, we start with the overall estimate Y − X
based on 30 X’s and 30 Y’s. If we compute three differences in
means based on three groups with 10 X’s and 10 Y’s and then
optimally combine them, we just get back Y − X. Now suppose
we randomly divide the X’s into two groups of size 10 and 20
and the Y’s into groups of size 10 and 20. If we match the 20 X’s
with the 20 Y’s and the 10 X’s with the 10 Y’s, compute the
two differences in means, and optimally combine them we get
back Y − X. However, if we match the 20 X’s with the 10 Y’s
and the 10 X’s with the 20 Y’s, compute the two differences in
means, and optimally combine them, the resulting statistic is the
difference in a weighted average of the Y’s minus a weighted
average in the X’s. The result is an estimate that is more variable
than Y − X.
The relevance of the preceding example can be seen by considering the approximate linearity expression (1) provided in
the proof of Theorem 2.1. To a good approximation, the overall
MW behaves like the difference in two averages of iid random
variables. The same approximate linearity can be applied to the
individual Tii . However, when there is inequality in the λi , even
with the optimal weights, the statistic Tα˜ behaves like the difference in two weighted averages of iid random variables, where
the weights are unequal. The unequal weights in these averages
produce poorer performance as in the example using means.
Requiring the equality of the λi between strata ensures behavior of Tα˜ that is asymptotically equivalent to the full MW under
random ranking.
3. PERFECT RANKING:
LARGE–SAMPLE PROPERTIES
In this section large-sample properties of Tα˜ are obtained under the assumption of perfect ranking. This is necessary for
1111
comparison with the Bohn–Wolfe procedure, which is valid
only under this assumption. In the next section a new class of
models for imperfect rankings is introduced, and the behavior
of Tα˜ is studied for several members of this class.
The Pitman efficiency is evaluated for the shift model G(y) =
F(y − ). For perfect ranking, under the shift model, G(i) (y) =
F(i) (y − ) for each order statistic, i = 1, . . . , k. Additionally,
assume the balance criterion within strata, mi /(mi + ni ) → λ,
i = 1, . . . , k. Using the fact that the shift model holds for each
rank-specific distribution and the result in (7) concerning the
null variance of Tα˜ , it follows easily that the efficacy of Tα˜ under perfect ranking is
k
2
12λ(1 − λ)
αi f(i)
(x) dx.
(5)
i=1
If the rankings are random, replacing f(i) (x) by f (x), i =
1, . . . , k, in (5) gives the efficacy of Tα˜ under random ranking.
Because ki=1 αi = 1, this shows the Pitman ARE of Tα˜ is 1
with respect to MW under random ranking, as asserted in Section 2. Note that in situations where an imperfect-ranking model
ensures that the shift model holds within each of the strata, the
efficacy of Tα˜ under this imperfect-ranking model is obtained
by replacing f(i) (x) with f[i] (x), i = 1, . . . , k, in expression (5).
When F is stochastically smaller than G, we have each
F(i) is stochastically smaller than the corresponding G(i) , i =
1, . . . , k. Because Tα˜ is a convex combination of MW statistics
(MWi /mi ni ) based on F(i) and G(i) , i = 1, . . . , k, the consistency of Tα˜ under perfect ranking follows from the consistency
of each individual MW test for a stochastically ordered alternative. In Section 4 a class of models for imperfect rankings
is developed, which ensures that, under suitable conditions, if
F is stochastically smaller than G, we have each F[i] is stochastically smaller than the corresponding G[i] , i = 1, . . . , k. This
leads to some general results for the consistency of Tα˜ under
imperfect-ranking models.
Under perfect ranking and a balanced ranked-set sample as
in Bohn and Wolfe, the Pitman efficiencies of BW relative to
MW are 1.5, 2, and 2.5 for k = 2, k = 3, and k = 4, respectively
(see Bohn and Wolfe 1992). The increased efficiency of the BW
statistic relative to the MW statistic is due to the reduction in
the null variance of the statistic and, thus, does not depend on
the underlying distribution. Table 1 provides the Pitman efficiency of Tα˜ relative to the MW under perfect ranking for the
normal, t(3), and uniform distributions. The results in the table are for balanced ranked-set samples (αi = 1/k) and set sizes
k = 2, 3, and 4. Some of these situations are presented in the
simulation study of Section 5.
Both Tα˜ and MW have the same asymptotic null variance.
The increased efficiency of Tα˜ relative to MW is due to the
derivative of the mean function being larger because of the selected within-strata comparisons being made by the Tα˜ statistic.
Table 1. Pitman Efficiencies of Tα˜ Relative to MW
Distribution
k=2
k=3
k=4
Normal
t (3)
Uniform
1.48
1.40
1.78
1.95
1.80
2.56
2.42
2.21
3.35
1112
Journal of the American Statistical Association, September 2006
This is why the efficiency depends on the underlying distribution. By comparing the results of the table to the efficiencies of
BW relative to MW given previously, we see that BW is more
efficient than Tα˜ for the t(3), only slightly more efficient for
the normal, and less efficient for the uniform. These efficiency
results agree well with the simulation results presented in Section 5.
4. IMPERFECT–RANKING MODELS
Ranked-set sampling presumes that an investigator can examine a set of k units and assign ranks to them. Assignment
of ranks is subjective and, in typical contexts, may not correspond to a perfect ranking of the (unmeasured) Xi , i = 1, . . . , k.
The literature contains two main classes of imperfect-ranking
models. The first class is based on a ranking of the units’ perceived values, with the perceived values tied directly to the unmeasured, true values of the units. The second class (Bohn and
Wolfe 1994) is based on the selection of order statistics. In this
section we introduce a third class of models that overlaps the
first class, and we investigate properties of the new models. The
behavior of Tα˜ is studied under these models.
Dell and Clutter (1972) presented the first class of models, which have an antecedent special case dating at least to
Thurstone (1927). They posited a set of true values Xi , i =
1, . . . , k, for the units in a set. Along with its true value, each
unit has a perceived value, Zi = Xi + i , where the i are iid
draws from some distribution. This leads to the set of k vectors (Xi , Zi ). The units are ranked according to the Zi , with the
Xi traveling along as concomitants. Importantly, the Zi can be
considered a mental construct used only to generate the ranking that produces the X[i] . The models do not state that one can
record the Zi , and so these values are not accessible to build
regression models, perform diagnostics, and so forth. Thus, the
ordered set of pairs correspond to (X[i] , Z(i) ).
Dell and Clutter’s model has several nice features. Among
them, the probability of ranking Xi as larger than Xj depends
on the difference between the two true values of the units,
δ = Xi − Xj . The probability equals P(i − j > −δ). It is an
increasing function of δ. The model also yields stochastic ordering of the distributions of the X[i] , i = 1, . . . , k. However, in
some circumstances, the model seems inadequate, because the
probability that Xi is ranked larger than Xj depends only on δ,
but not on the value of Xi . Often, one imagines that it is more
difficult to correctly rank two units with large values and a separation of δ than it is to correctly rank two units with small
values and a separation of δ. See Presnell and Bohn (1999) for
a discussion on the two classes of ranking models.
An imperfect-ranking model based on perceived values
should satisfy the following properties:
4. The rank-specific distributions should be ordered. That is,
the distributions of X[i] should be stochastically increasing
in i, i = 1, . . . , k.
5. For the two-sample problem, if the distribution of Y is stochastically greater than the distribution of X, then Y[i] is
stochastically greater than X[i] for each i = 1, . . . , k.
To develop a family of models that have these properties, we
rely on notation that facilitates presentation of cross-population
comparisons. We write, as a model for actual value (T) and perceived value (U),
T|θ ∼ Fθ ,
(6)
U|T, θ ∼ GT ,
(7)
where the distribution GT does not involve the parameter θ . The
two populations are determined by values of the parameter θ .
We use θ1 ≤ θ2 to index the X and Y populations, respectively.
Note that the distribution of a perceived value, U, depends only
on the actual value of the unit, T. It depends on the population, indexed by θ , indirectly, through the influence of θ on T.
The new family of models supplements the hierarchical model
in (6) and (7) with the assumptions that the Fθ are monotone
likelihood ratio (MLR) in T and that the GT are MLR in U. We
refer to a model in this new family as an MLR-ranking model.
For Dell and Clutter’s model, the family of distributions GT is
a location family with location parameter T. We allow much
more general forms for the GT .
We recall several familiar facts about MLR families as they
relate to our problem. First, recall that the MLR property implies stochastic ordering of the distributions. That is, if U|T is
MLR in U, then the distributions GT are stochastically increasing in T. Thus, the assumption that U|T is MLR in U ensures
that property 1 holds. Second, if U|T is MLR in U, then, for
any strictly monotone g(·), U|g(T) is also MLR in U. Through
choice of g, this establishes property 2. Third, if U|T is MLR
in T, and if V is a coarsened version of U created through interval censoring, then V|T is MLR in V. Creating a coarsening
of U in this way allows us to produce models with property 3.
Fourth, if U|T is MLR in U, then T|U is MLR in T (see, e.g.,
Whitt 1979). This fact allows us to examine features of the distribution of the actual values given the perceived values and is
used in the theoretical development to follow.
Lemma 4.1. Assume that the MLR-ranking model holds.
Then the distributions U|θ are stochastically increasing in θ .
Proof. The MLR-ranking model implies that the distribution
of T|θ2 is stochastically greater than that of T|θ1 . The family
of distributions of U|T are stochastically increasing in T. Combining these two facts, we see that the distribution of U|θ2 is
stochastically greater than the distribution of U|θ1 .
The next theorem shows that property 4 holds.
1. Perceived values associated with larger actual values
should tend to be larger. Formally, the distribution for
Z|X = x should be stochastically increasing in x.
2. The family of models should allow for a trend in the probability of misranking units for fixed δ as a function of X1 .
3. The family of models should allow for ties in the rankings
(i.e., the model should allow for a ranker being unable to
rank some units).
Theorem 4.2. Assume that the MLR-ranking model holds
and that a ranked-set sample with set size k is chosen from some
population. Then the distribution of T[i] is stochastically smaller
than that of T[ j] for i < j.
Proof. Focus on a single set of size k. T[i] and T[ j] are
concomitants of U(i) and U( j) , respectively. Also, U(i) ≤ U( j) .
Consider now the distribution of T|U = ui or T|U = uj . The
Fligner and MacEachern: Nonparametric Two-Sample Methods
distribution of T|U is MLR in T and is, therefore, stochastically
increasing in U. This establishes the result.
The proof of property 5 follows the same line as the proof of
the previous theorem. To prove the result, we must handle two
additional difficulties—that we have two populations indexed
by different parameter values and that we wish to compare observations that have the same rank. To this end, we consider
a one-dimensional path through the space for (u, θ ). Define a
monotone path, P(s) = (u(s), θ (s)), to be a path for which the
functions u(s) and θ (s) are nondecreasing. The next lemma is
useful in the proof of the upcoming theorem.
Lemma 4.3. Assume that the MLR-ranking model holds.
Then each of the following distributions is MLR in T:
a. T|U = u, θ for fixed U = u.
b. T|U, θ for fixed θ .
c. T|P for any monotone path P.
Proof. Throughout the proof, take t1 < t2 and θ1 < θ2 .
a. With U = u fixed and using g(u|θ ) to represent the marginal density of g, we have
f (t1 |U = u, θ1 )
f (t1 |θ1 )g(u|t1 , θ1 )/g(u|θ1 )
=
f (t1 |U = u, θ2 )
f (t1 |θ2 )g(u|t1 , θ2 )/g(u|θ2 )
≥
f (t2 |θ1 )g(u|t2 , θ1 )/g(u|θ1 )
f (t2 |θ2 )g(u|t2 , θ2 )/g(u|θ2 )
=
f (t2 |U = u, θ1 )
.
f (t2 |U = u, θ2 )
This inequality establishes the MLR property.
b. With θ fixed, we have a joint distribution on T and U.
Because U|T is MLR in U, it follows that T|U is MLR in T.
c. This part follows from parts a and b. We work with an
alternate formulation of the likelihood ratios and take s2 > s1 :
f (t1 |U = u(s1 ), θ (s1 )) f (t1 |U = u(s1 ), θ (s2 ))
≥
f (t2 |U = u(s1 ), θ (s1 )) f (t2 |U = u(s1 ), θ (s2 ))
≥
f (t1 |U = u(s2 ), θ (s2 ))
.
f (t2 |U = u(s2 ), θ (s2 ))
Rewriting the inequality, we have
f (t1 |U = u(s1 ), θ (s1 )) f (t2 |U = u(s1 ), θ (s1 ))
≥
,
f (t1 |U = u(s2 ), θ (s2 )) f (t2 |U = u(s2 ), θ (s2 ))
which establishes part c.
Now we present the theorem that establishes property 5.
Theorem 4.4. Assume that the MLR-ranking model holds,
with the X population indexed by θ1 and the Y population
indexed by θ2 , where θ1 < θ2 . Assume that ranked-set samples with set size k are collected independently from the
X and Y populations. Then the distribution of X[i] is stochastically smaller than the distribution of Y[i] for each i = 1, . . . , k.
Proof. Lemma 4.1 establishes that the distribution U|θ1 is
stochastically smaller than is the distribution U|θ2 . To distinguish between the perceived values for observations from the
X and Y populations, we use Z and W, respectively, and so we
have that Z is stochastically smaller than W and, hence, that
Z(i) is stochastically smaller than W(i) for each i = 1, . . . , k.
1113
Fix i and imagine that Z(i) and W(i) are generated in a coupled
fashion. That is, a single uniform variate is used to generate
both perceived values with the inverse cdf method. Then the
realized values are ordered so that Z(i) ≤ W(i) .
Consider now the distribution of T[i] |U(i) = u, θ . For us,
the perceived X and Y values are z and w, respectively, with
z ≤ w. The conditional distribution for T[i] is the same as that
for T|U = u, θ . Define a monotone path, P, which has P(0) =
(z, θ1 ) and P(1) = (w, θ2 ). Such a path exists because z ≤ w and
θ1 < θ2 . Applying Lemma 4.3, we have that the distributions
along this path are MLR in T and, hence, that the distributions
for T are stochastically increasing. Noting that the distribution
for Y[i] |W(i) , θ2 corresponds to a larger value of s [where s indexes the path P(s)] than does the distribution of X[i] |Z(i) , θ1 establishes the result.
This has established stochastic ordering of the distributions
for each stratum. The Mann–Whitney test is known to be consistent against stochastically ordered distributions, and so each
within-stratum test is consistent. The next corollary establishes
that the proposed test is consistent by combining tests across
the (finite) number of strata.
Corollary 4.5. Assume that the MLR-ranking model holds,
with the X population indexed by θ1 and the Y population indexed by θ2 , where θ1 > θ2 . Further assume that ranked-set
samples with set size k are collected independently from the
X and Y populations and that, for some i ∈ {1, . . . , k}, both
mi and ni tend to ∞. Then the test based on Tα˜ is consistent
for a test of H0 : θ1 = θ2 against the alternative HA : θ1 > θ2 .
We note that milder properties may result in models that do
not yield X[i] stochastically smaller than Y[i] for each i. For example, it is easy to construct a counterexample where X is stochastically smaller than Y, the model for U|T remains MLR
in U, but where, for some i, X[i] is not stochastically smaller
than Y[i] .
Example 1: Additive perceptual errors. The classic example
of perceptual error has been used by Thurstone and by Dell and
Clutter:
T|θ ∼ N(θ, τ 2 ),
U|T ∼ N(T, σ 2 ).
At each stage of the model, the normal distributions, indexed
by their mean, form an MLR family. Dell and Clutter’s models
have been used with a variety of families replacing the normal
model for the true values. The normal distribution for the perceptual values can also be replaced with a different distribution
having mean T.
Example 2: Nonadditive perceptions 1. At times, we must
step outside the mode of additive perceptual errors, taking us
beyond Dell and Clutter’s model. One method of accomplishing this is to use a link function. An alternative version of
Thurstone’s model suggests that in some instances comparisons
of positive quantities, such as distances and areas, are made on
the basis of the relative size of the objects. This suggests the
following model, which exhibits property 2:
T|θ ∼ lognormal(θ, τ 2 ),
U|T ∼ N(log(T), σ 2 ).
1114
Journal of the American Statistical Association, September 2006
Property 1 ensures that the family of lognormal distributions
(with fixed τ 2 ) is MLR in T. The link function log(·) adjusts the
true values so that two values separated by δ are closer together
for perceptual purposes if the true values are large than if they
are small. A reviewer has suggested (and we agree) that this
qualitative effect should hold for judging the relative sizes of
books selected from a library shelf. Replacing the log link function with an arbitrary monotonic transformation preserves the
MLR property for U|T and, hence, preserves the MLR model.
Example 3: Nonadditive perceptions 2. Models for nonadditive perceptual error can be obtained in many ways. As an
example, we write a model for perceptions, which is parameterized by T:
T|θ ∼ Weibull(θ, β),
U|T ∼ gamma(cT, 1),
T θ /β
follows an exponential distribution with mean 1.
where
For the gamma distribution, cT is the shape parameter. The parameter c governs the quality of the rankings, with larger c indicating better rankings. The gamma(cT, 1) perceptual portion
of the model can be replaced by a gamma(α, cT) distribution.
Example 4: Tied ranks 1. Empirically, we have observed
(MacEachern, Stasny, and Wolfe 2004) that rankers feel that
they are sometimes unable to make a judgment about a pair of
items. When the true values lie in an interval, we may write the
model
T|θ ∼ beta Mθ, M(1 − θ ) ,
U|T ∼ binomial(n, T).
The parameter n in the binomial distribution determines the
likelihood of a tie in the rankings. This yields property 3, a feature that the additive perceptual errors of Dell and Clutter’s
model cannot accommodate.
Example 5: Tied ranks 2. A second approach to ties is to
begin with a continuous model for the perceptual values and
then coarsen the perceptual space. For example, we may write
T|θ ∼ N(θ, τ 2 ),
V|T ∼ N(T, σ 2 ),
U|V ∼ round(V).
Because the model for V is MLR, so is the model for U. The
models produced in this fashion include categorical models for
a response, such as those associated with ordinal regression,
where a set of cut points defines the categories.
5. FINITE–SAMPLE COMPARISON OF
THE PROCEDURES
In this section we compare the performances of the Bohn–
Wolfe (BW) and proposed (T) procedures based on ranked-set
samples and the Mann–Whitney–Wilcoxon (MW) and randomsplit Mann–Whitney–Wilcoxon (RMW) procedures based on
simple random samples. The comparison is based on a simulation study where we investigated a variety of set sizes and
underlying distributions and where we considered a variety of
quality of rankings, ranging from perfect to random. The simulation study investigated far more cases than are presented
here. We have selected cases for presentation that demonstrate
a range of behaviors for the tests.
The reported simulations consist of four models for rankings
with various set sizes k and total sample sizes M and N. The
first three models are additive perceptual-error models
Zi = Xi + i ,
i = 1, . . . , k,
as described in Example 1 of Section 4. The perceptual errors i
are normally distributed, with the true values Xi having a t distribution with 3 degrees of freedom, the normal distribution, or
the uniform distribution. These correspond to tests of H0 : θ = 0
versus HA : θ = 0. The correlation between Zi and Xi determines
the information content of the rankings. We have used models
with ρ = 1 for perfect ranking and ρ = .9 and .7 for imperfect
rankings. Once the correlation drops below .7, we have found
that there is little information contained in the rankings. Under
perfect ranking, the resulting models correspond to the three
models for which the Pitman efficiency of the proposed test Tα
relative to BW was computed in Section 3. The fourth model
considered is the Weibull distribution with gamma perceptions
as described in Example 3 of Section 4, where the test is for
H0 : β = 1 versus HA : β = 1. The parameter c governs the quality of the rankings. This provides an example of an asymmetric
population distribution, where the distributions do not differ by
a shift parameter but instead are stochastically ordered.
The simulation itself relied on variance reduction techniques,
using a common stream of random numbers to investigate all
distributions, shifts, and ρ’s with a common set size and total
sample size. The same stream of random numbers was used to
generate the ranked-set samples and the simple random samples. For the MW, RMW, and T tests, randomized tests were
used, so that the level would be exactly .05. The BW procedure
declares a cutoff for significance based on the asymptotic normal distribution of the test statistic under perfect rankings, but
because the level is not maintained when the ranking is imperfect, no further attempt was made to control the level of this
procedure. Each power curve is based on 10,000 replicates.
The left panels of Figure 2 contain the results for the additive
perceptual-error model when the distribution of the true values
is standard normal, the set size is 3, and the sample sizes are
M = N = 36. The right panels contain the results for the additive perceptual-error model when the distribution of the true
values is uniform, the set size is 4, and the sample sizes are
M = N = 48. Each of the six panels displays four simulated
power curves. The lowest two curves in each panel are for the
MW and RMW procedures, with the RMW showing slightly
less power than the MW procedure. When rankings are perfect,
the BW curve is a little higher than the T curve for the normal
distribution, whereas the situation is reversed for the uniform
distribution. This observation agrees with the Pitman efficiency
results. The BW test makes no adjustment for the imperfections
in rankings, and so its level rises as the rankings worsen. Although the BW test provides greater power to detect θ > 0, this
is attributable to its higher level. In contrast to this behavior,
the T test’s power curve drops toward that of the RMW procedure, based on a simple random sample. The advantage of the
new ranked-set sample test over the procedures based on simple
random samples is substantial when ρ = .9 and still noticeable
when ρ = .7. The level of the BW test has risen to .110 for
Fligner and MacEachern: Nonparametric Two-Sample Methods
1115
Figure 2. Plots of Power Curves for Four Tests. The left panels are for M = N = 36 and k = 3. The population distribution is normal. The right
panels are for M = N = 48 and k = 4. The population distribution is uniform. In all plots, the broken line is T , the highest solid line is BW, and the
two low, solid, nearly indistinguishable lines are MW and RMW.
the lower left panel and to .140 for the lower right panel. For
random rankings, the levels rise to .165 and .215, respectively.
Figure 3 contains results for set size 2 and sample sizes
M = N = 24. The left panels are for the additive perceptualerror model when the distribution of the true values is a t with
3 degrees of freedom, and the right panels are for the Weibull
distribution with gamma perceptions. When rankings are perfect, the BW curve is slightly higher than the T curve for the
t model and is almost identical to T for the Weibull model. For
imperfect rankings, the situations are quite similar to those of
Figure 2. The level of the BW test rises to .095 and .099 for the
lower left and right panels, respectively. Under random rankings, the level of the BW test rises to .111 for both the normal model and the Weibull model. The patterns that we see in
Figures 2 and 3 held throughout the other situations that were
simulated.
Following are the main conclusions reached from the simulation:
• The ranked-set sampling procedures are preferable to the
simple random-sampling procedures when the quality of
the rankings is moderate to high. With low-quality rankings, the advantage of a ranked-set sample design is moderate, perhaps nearly disappearing.
• The RMW procedure is a near equivalent of the MW
procedure. The more extensive simulations show that the
small difference apparent in the figures shrinks as the
within-strata sample sizes grow. When these sample sizes
are 16 (the next smallest within-strata sample size we investigated), the difference is small.
• Under perfect rankings, the BW procedure and the T procedure are nearly equal. The T procedure is sometimes
a little better than the BW procedure, sometimes nearly
equal, and sometimes a little worse. Which procedure is
better depends on the underlying distribution. The Pitman
AREs provide a good match to the patterns that we see for
finite sample sizes.
• When rankings are imperfect, the actual level of the BW
procedure rises. It may be considerably higher than the
nominal level. The new procedure maintains its level and
provides additional power compared to procedures based
on a simple random sample. Importantly, our procedure
1116
Journal of the American Statistical Association, September 2006
Figure 3. Plots of Power Curves for Four Tests. All panels are for M = N = 24 and k = 2. The left panels have a population distribution that is
t with 3 df. The right panels are for the Weibull model. In all plots, the broken line is T , the highest solid line is BW, and the two low, solid, nearly
indistinguishable lines are MW and RMW.
does not rely on knowledge of the form of the imperfectranking model or on the specific parameters that determine
the imperfect-ranking model.
6. EXAMPLE
Investigators at Horticulture Research International and the
University of Kent conducted a pilot study in 1997 to evaluate
the effectiveness of ranked-set sampling. The pilot study (see
Murray, Ridout, and Cross 2000) focused on an experiment to
compare the coverage of spray deposits on the leaves of apple
trees under two different sprayer settings. Two plots of trees
were sprayed with a fluorescent water-soluble tracer at 2% concentration. The first plot was sprayed at a high volume with a
coarse nozzle on the sprayer to produce large droplet sizes (the
coarse treatment), and the second was sprayed at a low volume
with a fine nozzle on the sprayer to produce small droplet sizes
(the fine treatment). The variable of interest (%Cover) is the
percentage of the leaf surface covered by the spray. A second,
related variable (Deposit) quantified the amount of spray on the
leaf surface. Deposit was measured by washing the leaf surface
with 5 mL of water and measuring the relative concentration of
the tracer. We will restrict our attention to %Cover.
One hundred and twenty-five leaves were sampled from each
plot, and each sample was randomly divided into 25 sets of
size 5. For each set of size 5, an observer ranked the leaves
according to the perceived coverage based on the visual appearance of the leaf surface under ultraviolet light. The actual value
of the variable %Cover was measured using an image analysis
system (Optimax V). Because the purpose of the experiment
was to investigate the efficacy of RSS as a technique, %Cover
was measured on all 125 leaves on each treatment.
Measuring %Cover on all leaves allows us to investigate two
important issues. The first is evaluation of the quality of the
observer’s ranking, which can be done because we can determine the true ranking in each set of 5. The second is to simulate
many ranked-set samples of size 25 and many simple random
samples of size 25, all drawn from the same dataset. This gives
us a better comparison of the competing ranked-set sampling
procedures as well as enabling us to compare them to simple
random-sampling procedures. To draw an RSS from the data,
we randomly select five sets in which the observation judged to
Fligner and MacEachern: Nonparametric Two-Sample Methods
be the minimum is chosen, a different five sets in which the observation judged to be the second smallest is chosen, and so on.
The resulting sample is a balanced ranked-set sample. Such a
sample is chosen independently for each of the two treatments.
To select an SRS, a single observation is selected at random
from each of the 25 sets for each treatment. This method of selecting an SRS matches the conditional division of leaves into
sets (which was accomplished at random) for the RSS and results in more equitable comparisons.
We first consider the quality of the rankings. In the simulation study, the quality of the rankings was measured using the
correlation between perceived and actual values. Because the
perceived values are unobservable, we use the following idea
to link the quality of rankings in the example to the simulation
study. For the 25 sets of rankings made by the observer in the
coarse treatment, the average value of the Kendall tau distance
(Randles and Wolfe 1991) between the observer’s rankings and
the true rankings was 1.48, and for the fine treatment the average tau distance was 1.40. The distributions of the tau distances
were quite similar as well. We simulated the Dell–Clutter model
with a normal population distribution, normal errors, and a correlation of .9 to obtain 10,000 sets of five perceived and true
observations. After ranking the sets of perceived and true values, we found the average tau distance between the rankings to
be 1.43, in fairly good agreement with the example (a correlation of .7 gave an average tau distance of 2.52). This gives us a
sense that the observer’s rankings in the example, while not perfect, were of fairly high quality. The correlation of .9 relates the
quality of the rankings in the example, in a broad sense, to the
simulation study results. We return to this point in the following
discussion.
Although an examination of the data shows considerable
overlap between the two treatments, the coarse treatment
(mean = .221, sd = .162) had a clearly higher %Cover than
the fine treatment (mean = .153, sd = .115). In fact, an MW
test based on all 125 observations on each treatment yields a
two-tailed p value of .001. We simulated 10,000 ranked-set
samples of size 25 from each treatment, and for each sample
we computed both the BW statistic and the T statistic. In addition, 10,000 simple random samples of size 25 were obtained
from each treatment, and the MW statistic and the RMW statistic were computed. For two-sided level-.05 tests, rejection rates
for the BW test, the T test, the MW test, and the RMW test
were .7500, .5445, .2852, and .2783, respectively. Because the
coarse treatment apparently concentrates on larger values than
does the fine treatment, a large rejection rate points to effectiveness of the test in detecting this difference.
There is close agreement between the MW and the RMW
with the MW performing just slightly better, as was shown in
the simulation. The ranked-set sampling design, in conjunction
with the T statistic, shows a clear superiority to the MW procedure. As expected, the BW test rejects substantially more often
than the proposed procedure because the level of the BW test
is not maintained under imperfect ranking. Our previous discussion of the quality of rankings suggests that in this example a Dell–Clutter model with a correlation of .9 may provide a
rough approximation to the quality of rankings for this observer.
We simulated 10,000 replications under the null hypothesis of
identical populations with the Dell–Clutter model, a correlation of .9, and a set size of 5. The simulated level of the BW
1117
procedure was .1086, whereas with the same data, the level of
the proposed procedure was .0510. This potential doubling of
the level can easily explain the increased number of rejections
for the BW test. If the proposed procedure were performed at
the .10 level instead of the .05 level on the 10,000 ranked-set
samples simulated from the example, the rejection rate would
increase from .5445 to .7009. Importantly, although the quality
of rankings in this example is quite good, the results suggest
that the rise in level due to even small departures from perfect
ranking can make the BW procedure inappropriate as a tool for
comparing two samples.
7. CONCLUSIONS
From a theoretical perspective, a ranked-set sample design
has been shown, in many contexts, to be a preferable design
to a simple random-sample design. The additional information,
collected in the form of ranks associated with the observations,
allows one to construct more powerful tests and more accurate estimators. The drawback of many theoretical results is that
they rely on very strong assumptions, such as perfect rankings,
balanced samples, exact symmetry of a population distribution,
or rankings that, while imperfect, can be described with perfect
knowledge. The central problem for ranked-set sampling is the
development of procedures that are robust to the sort of violation of assumptions likely to be seen in practice. It is this issue
that we have addressed in this article.
One practical issue that our procedures address is how to analyze an unbalanced ranked-set sample. Whether the imbalance
arises by design, through loss of units to full measurement,
or through judgment poststratification (see MacEachern et al.
2004), the procedures we discuss allow us to make a full range
of inferences about the shift parameter, . The second practical
issue that we address is ensuring that hypothesis tests have an
actual level equal to their nominal level. The level of the BW
test rises to unacceptable levels when ranking is imperfect.
We have developed a new class of models for imperfect ranking that are based on perceptions about the actual units in a set.
This began with an intuitively reasonable set of properties that
models based on perceived values should satisfy. A set of assumptions were then developed, which ensure that these properties hold. A series of examples illustrates the scope of the
models.
[Received May 2003. Revised December 2005.]
REFERENCES
Bohn, L. L. (1996), “A Review of Nonparametric Ranked-Set Sampling
Methodology,” Communications in Statistics, Part A—Theory and Methods,
25, 2675–2685.
Bohn, L. L., and Wolfe, D. A. (1992), “Nonparametric Two-Sample Procedures
for Ranked-Set Samples Data,” Journal of the American Statistical Association, 87, 552–561.
(1994), “The Effect of Imperfect Judgment Rankings on Properties of
Procedures Based on the Ranked-Set Samples Analog of the Mann–Whitney–
Wilcoxon Statistic,” Journal of the American Statistical Association, 89,
168–176.
Dell, T. R., and Clutter, J. L. (1972), “Ranked Set Sampling Theory With Order
Statistics Background,” Biometrics, 28, 545–555.
Hoeffding, W. (1948), “A Class of Statistics With Asymptotically Normal Distribution,” The Annals of Mathematical Statistics, 19, 293–315.
Kaur, A., Patil, G. P., Sinha, A. K., and Taillie, C. (1995), “Ranked Set Sampling: An Annotated Bibliography,” Environmental and Ecological Statistics,
2, 25–54.
Lehmann, E. L. (1951), “Consistency and Unbiasedness of Certain Nonparametric Tests,” The Annals of Mathematical Statistics, 22, 165–179.
1118
MacEachern, S. N., Stasny, E. A., and Wolfe, D. A. (2004), “Judgment PostStratification With Imprecise Rankings,” Biometrics, 60, 207–215.
McIntyre, G. A. (1952), “A Method of Unbiased Selective Sampling Using
Ranked Sets,” Australian Journal of Agricultural Research, 3, 385–390.
Murray, R. A., Ridout, M. S., and Cross, J. V. (2000), “The Use of Ranked
Set Sampling in Spray Deposit Assessment,” Aspects of Applied Biology, 57,
141–146.
Ozturk, O., and Wolfe, D. A. (2000), “An Improved Ranked Set Two-Sample
Mann–Whitney–Wilcoxon Test,” The Canadian Journal of Statistics, 28,
123–135.
Journal of the American Statistical Association, September 2006
Presnell, B., and Bohn, L. L. (1999), “U-Statistics and Imperfect Ranking in
Ranked Set Sampling,” Journal of Nonparametric Statistics, 10, 111–126.
Randles, R. H., and Wolfe, D. A. (1991), Introduction to the Theory of Nonparametric Statistics (2nd ed.), New York: Wiley.
Stokes, S. L., and Sager, T. W. (1988), “Characterization of a Ranked-Set Sample With Application to Estimating Distribution Functions,” Journal of the
American Statistical Association, 83, 374–381.
Thurstone, L. L. (1927), “A Law of Comparative Judgement,” Psychological
Reviews, 34, 273–286.
Whitt, W. (1979), “A Note on the Influence of the Sample on the Posterior
Distribution,” Journal of the American Statistical Association, 74, 424–426.