Higher Order Testlet Response Models for Hierarchical Latent Traits
Transcription
Higher Order Testlet Response Models for Hierarchical Latent Traits
Article Higher Order Testlet Response Models for Hierarchical Latent Traits and Testlet-Based Items Educational and Psychological Measurement 73(3) 491–511 Ó The Author(s) 2012 Reprints and permissions: sagepub.com/journalsPermissions.nav DOI: 10.1177/0013164412454431 epm.sagepub.com Hung-Yu Huang1 and Wen-Chung Wang2 Abstract Both testlet design and hierarchical latent traits are fairly common in educational and psychological measurements. This study aimed to develop a new class of higher order testlet response models that consider both local item dependence within testlets and a hierarchy of latent traits. Due to high dimensionality, the authors adopted the Bayesian approach implemented in the WinBUGS freeware for parameter estimation. A series of simulations were conducted to evaluate parameter recovery, consequences of model misspecification, and effectiveness of model–data fit statistics. Results show that the parameters of the new models can be recovered well. Ignoring the testlet effect led to a biased estimation of item parameters, underestimation of factor loadings, and overestimation of test reliability for the first-order latent traits. The Bayesian deviance information criterion and the posterior predictive model checking were helpful for model comparison and model–data fit assessment. Two empirical examples of ability tests and nonability tests are given. Keywords item response theory, testlet response theory, higher order models, hierarchical latent trait, Bayesian methods Testlet design, in which a set of items share a common stimulus (e.g., a reading comprehension passage or a figure), has been widely used in educational and 1 Taipei Municipal University of Education, Taipei, Taiwan The Hong Kong Institute of Education, Hong Kong, Hong Kong SAR 2 Corresponding Author: Wen-Chung Wang, 10 Lo Ping Road, Tai Po, New Territories, Hong Kong SAR. Email: [email protected] Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 492 Educational and Psychological Measurement 73(3) psychological tests. Many test developers find testlet design attractive because of its efficiency in item writing and administration. However, testlet design poses a challenge to standard item response theory (IRT) models, because items within a testlet are connected with the same stimulus such that the usual assumption of local item independence in standard IRT models may not hold. Fitting standard IRT models to such data leads to biased parameter estimation and overestimation of test reliability (Thissen, Steinberg, & Mooney, 1989; Wainer & Lukhele, 1997; Wainer & Thissen, 1996; Wainer & Wang, 2000). Various testlet response models have been developed to account for the local dependence among items within a testlet by adding a set of random-effect parameters to standard IRT models, one for each testlet (Bradlow, Wainer, & Wang, 1999; Wainer, Bradlow, & Du, 2000; Wainer, Bradlow, & Wang, 2007; W.-C. Wang & Wilson, 2005). In the human sciences, latent traits may be hierarchical. For example, a hierarchical structure orders mental abilities with g (general ability) at the top of the hierarchy and other broad abilities (e.g., mathematical, spatial-mechanical, and verbal reasoning) lower down. Specifically, the Wechsler Adult Intelligence Scale measures a total of three orders of latent traits. At the first order, there are 13 subtests, each measuring a specific ability. At the second order, these 13 specific abilities are grouped into four broad abilities: verbal comprehension, perceptual reasoning, working memory, and processing speed. At the third order, these four broad abilities constitute a general mental ability (Ryan & Schnakenberg-Ott, 2003). The SAT Reasoning Test consists of three sections: critical reading, mathematics, and writing. Each section receives a score on a scale of 200 to 800. Total scores are calculated by adding up the scores of the three sections. One can treat the tests as measuring three first-order latent traits (critical reading, mathematics, and writing) and one second-order latent trait (scholastic aptitude). Similarly, the Graduate Record Examinations consist of three sections of verbal, quantitative, and analytical writing. One can treat the examinations as measuring three first-order latent traits (verbal reasoning, quantitative reasoning, and analytical writing skills) and one second-order latent trait (scholastic aptitude). Recently, several higher order IRT models for hierarchical latent traits have been developed. These include the cognitive diagnosis models by de la Torre and Douglas (2004), multidimensional models with a hierarchical structure by Sheng and Wikle (2008), and higher order IRT models by de la Torre and Song (2009), de la Torre and Hong (2010), and Huang, Wang, and Chen (2010). In the IRT literature, testlet response models and higher order IRT models were developed separately. Since hierarchical latent traits may be measured by testlet-based items, it is of great value to develop a class of IRT models that consider both testlet design and hierarchical structure in latent traits, which is the main purpose of this study. This article is organized as follows. First, higher order IRT models are briefly introduced and a new class of higher order testlet response models is formulated. Second, parameter estimation and model–data fit assessment for the new class of models are described. Third, simulations are conducted to assess parameter recovery, effects of model misspecification, and model–data fit indices using the WinBUGS computer program Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 493 (Spiegelhalter, Thomas, & Best, 2003), and the results are summarized. Fourth, two empirical examples of achievement and personality assessments are presented to demonstrate the applications of the new class of higher order testlet response models. Fifth, conclusions are drawn and suggestions for future study are provided. Higher Order IRT Models Assume the hierarchy of latent traits is in two orders. Each of the first-order latent traits is measured by a set of unidimensional items, and these first-order latent traits are governed by the same second-order latent trait. Let u(2) n be the second-order latent be the vth first-order latent trait for person n. The relationship between trait and u(1) nv these two orders of latent traits is assumed to be (2) (1) u(1) nv = bv un + env , ð1Þ where e(1) nv is assumed to be normally distributed with mean zero and independent of other es and us, and bv is the regression weight (factor loading) between the secondorder latent trait and the vth first-order latent trait. More orders of latent traits (e.g., the third order) can be readily generalized. For a dichotomous item measuring the first-order latent trait, the item response function can be the one-, two-, or three-parameter logistic model (Birnbaum, 1968; Rasch, 1960) or other IRT functions. In the three-parameter logistic model, the probability of a correct response to item i in test v for person n is Pni1v = piv + (1 piv ) 3 exp½aiv (u(1) nv div ) , 1 + exp½aiv (u(1) nv div ) ð2Þ where aiv is the slope (discrimination) parameter, div is the location (difficulty) parameter, and piv is the asymptotic (pseudo-guessing) parameter of item i in test v. Combining Equations 1 and 2 leads to Pni1v = piv + (1 piv ) 3 (1) exp½aiv (bv u(2) n div + env ) (1) 1 + exp½aiv (bv u(2) n div + env ) : ð3Þ If the item response function follows the two-parameter logistic model, piv in Equation 3 becomes zero. If it follows the one-parameter logistic model, piv and aiv in Equation 3 become zero and one, respectively. Equation 3 is referred to as the three-parameter higher order IRT model (3P-HIRT). If the item response function is the two-parameter logistic model, then it is called the 2P-HIRT. If the item response function is the one-parameter logistic model, then it is called the 1P-HIRT (de la Torre & Hong, 2010; Huang et al., 2010). For a polytomous item measuring the first-order latent trait, the item response function can be the generalized partial credit model (Muraki, 1992), the partial credit model (Masters, 1982), the rating scale model (Andrich, 1978), or the graded Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 494 Educational and Psychological Measurement 73(3) response model (Samejima, 1969). Take the generalized partial credit model as an example. The log odds are defined as Pnijv log ð4Þ = aiv (u(1) nv div tijv ), Pni(j1)v where Pnijv and Pni(j 2 1)v are the probabilities of scoring j and j 2 1 on item i in test v for person n, respectively; div is the overall difficulty of item i in test v; tijv is the jth threshold parameter of item i in test v; and the others are defined as above. Combining Equations 1 and 4 leads to Pnijv (1) log ð5Þ = aiv (bv u(2) n div tijv + env ): Pni(j1)v If the item response function follows the partial credit model, then aiv = 1. If it follows the rating scale model, then aiv = 1 and tijv = tjv. If it follows the graded response model, then cumulative logit should be used as the link function and a common set of step parameters are used across items. The corresponding higher order IRT model can be easily formed (Huang et al., 2010). When the item response function is the generalized partial credit model, the corresponding higher order IRT model is denoted as the GP-HIRT. Likewise, when item response function is the partial credit model, rating scale model, or graded response model, the corresponding higher order IRT model is denoted as the PC-HIRT, RS-HIRT, or GR-HIRT, respectively. Higher Order Testlet Response Models Assume that testlet-based items are used and the testlet effect exists in each testlet. To account for the testlet effect, one can add a random-effect parameter to Equation 2 (dichotomous item) or Equation 4 (polytomous item): Pni1v = piv + (1 piv ) 3 exp½aiv (u(1) nv div + gnd(i)v ) 1 + exp½aiv (u(1) nv div + gnd(i)v ) , Pnijv log = aiv (u(1) nv div tijv + gnd(i)v ), Pni(j1)v ð6Þ ð7Þ where gnd(i)v is an additional latent trait to account for local dependence among items within testlet d of test v and is assumed to be normally distributed and independent of u and other gs (Wainer et al., 2007; W.-C. Wang & Wilson, 2005). The g variance depicts the magnitude of the testlet effect: the larger the variance, the stronger the testlet effect is. When the latent traits are hierarchical, one can combine Equations 1 and 6 or Equations 1 and 7: Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang Pni1v = piv + (1 piv ) 3 495 (1) exp½aiv (bv u(2) n div + env + gnd(i)v ) (1) 1 + exp½aiv (bv u(2) n div + env + gnd(i)v ) Pnijv (1) = aiv (bv u(2) log n div tijv + env + gnd(i)v ): Pni(j1)v , ð8Þ ð9Þ Equation 8 is referred to as the three-parameter higher order testlet response model (3P-HTM). If piv = 0 for all iv, then the 3P-HTM is reduced to the two-parameter higher order testlet response model (2P-HTM). If piv = 0 and aiv = 1 for all iv, then the 3P-HTM is reduced to the one-parameter higher order testlet response model (1PHTM). Equation 9 is referred to as the generalized partial credit higher order testlet response model (GP-HTM), because its item response function is the generalized partial credit. Likewise, if the item response function is the partial credit model, rating scale model, or graded response model, then it is denoted as the PC-HTM, RS-HTM, or GS-HTM, respectively. In the new class of HTM, the item response function is not limited to a specific function. Users are allowed to assign any function to any item. The HTM can accommodate the situation where a test consists of both independent items and testlet-based items and both dichotomous items and polytomous items. In addition, it is feasible to assign different item response functions to different items, even in the same test. For example, some dichotomous items may follow the one-parameter logistic model, and others follow the two-parameter or three-parameter logistic model. Some polytomous items follow the rating scale model, and others follow the generalized partial credit model. Users are also allowed to create their own customized item response functions. For model identification, one can set u(2) ; N (0, 1) and u(1) nv ; N (0, 1) so that bv can be interpreted as the correlation between u(2) and u(1) v , as in traditional linear factor analysis. When the item response function follows the family of Rasch models, the common slope parameter across all items in the same test can be freely estimated. MCMC Estimation and Bayesian Model–Data Fit Assessment The new class of higher order testlet response models often involves many randomeffect parameters––u, g, and e. Although it is possible to develop likelihood-based estimation methods for higher order testlet response models, these methods become computationally burdensome because of high-dimensional numerical integration and may fail to converge. Bayesian estimation with the Markov chain Monte Carlo (MCMC) methods, which have been widely used in complicated IRT models (Bolt & Lall, 2003; Bradlow et al., 1999; Fox & Glas, 2003; Kang & Cohen, 2007; Klein Entink, Fox, & van der Linden, 2009; Wainer et al., 2000, 2007), were thus used to estimate the parameters. In Bayesian estimation, specifications of a statistical model, prior distributions of model parameters, and observed data are required to produce a joint posterior distribution for the model parameters. MCMC methods provide an Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 496 Educational and Psychological Measurement 73(3) alternative and simple way to simulate the joint posterior distribution of the unknown quantities and obtain simulation-based estimates of posterior parameters of interest. The Metropolis-Hastings and the Gibbs sampling algorithms (Gelfand & Smith, 1990; Geman & Geman, 1984) are two major procedures to construct the Markov chain through transition density to target density. In this study, we used the WinBUGS freeware (Spiegelhalter et al., 2003) to simulate Markov chains, and the mean of the joint posterior distribution was treated as the parameter estimate. WinBUGS is flexible enough to allow users to specify different models for different items within or between tests. For example, a test may consist of both dichotomous items and polytomous items, in which some dichotomous items follow the 1PLM, some dichotomous items follow the 3PLM, some polytomous items follow the partial credit model, and some polytomous items follow the rating scale model. A test may consist of both independent items (no testlet effects) and testlet-based items, exclusively independent items, or exclusively testlet-based items. All these combinations can be easily formulated in WinBUGS. Posterior predictive model checking (PPMC; Gelman, Meng, & Stern, 1996) is a Bayesian model–data fit checking technique to assess the plausibility of posterior predictive replicated data against observed data, which has the advantage of a strong theoretical basis and an intuitively appealing simplicity applied with numerical evidence. Let y be the observed data and yrep be the replicated data. The posterior predictive density function of yrep given model parameter v can be given by ð P(yrep jy) = P(yrep jv)P(vjy)dv, ð10Þ A test statistic T is chosen to detect the systematic discrepancy between the observed data and the replicated data. The posterior predictive p value is defined as follows to summarize the comparison between the two test statistics over a large number of iterations: ð rep p [ P½T (y ) T (y)jy = P(yrep jy)dyrep : ð11Þ T (yrep )T (y) An extreme p value (close to 0 or 1) indicates a poor model–data fit. According to the nature of the higher order testlet response model (HTM), four statistics were chosen to assess model-data fit in this study. The first one is the Bayesian chi-square test (Sinharay, Johnson, & Stern, 2006), which evaluates the overall model–data fit: X X ½yni E(yni jv)2 n i Var(yni jv) , ð12Þ where subscript n is for person and subscript i is for item. The second statistic evaluates factor structure. It compares the reproduced correlation matrix with the original Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 497 correlation matrix. If the difference between the two matrices is huge, then the factors do not account for a great deal of the variance in the original correlation matrix, and thus, the model–data fit is poor. The third and fourth statistics measure the association among item pairs within a testlet for dichotomous items and polytomous items, respectively. For dichotomous items, we compute the odds ratios between item pairs within a testlet to depict uncontrolled dependence between items. Let Nkk# be the number of persons scoring k on the first item and k# on the second item. The odds ratio (OR) between an item pair is given by OR = N11 N00 : N10 N01 ð13Þ For polytomous items, the Spearman rank correlation between item pairs within a testlet is computed to assess between-item association. If any systematic discrepancy is found between the odds ratios (or the rank correlations) obtained from the observed data and the replicated data, misfit due to the unexplained variance from item dependence is detected (Sinharay et al., 2006). To reduce the computational burden and concentrate on testlet effects, only the odd ratios and the correlations for item pairs within a testlet are computed, and those across testlets are not computed. For model comparison, one can use the Bayesian deviance information criterion (DIC), which simultaneously takes into account model fit and model complexity (Spiegelhalter, Best, Carlin, & van der Linde, 2002): DIC = D + PD, ð14Þ where D is the posterior expectation of the deviance and PD is defined as the difference between the posterior mean of the deviance and the deviance at the posterior mean of the parameters. Method Simulation Design The simulations comprised two major parts: dichotomous items and polytomous items. Each part had three first-order latent traits and one second-order latent trait. Each test measuring the first-order latent trait consisted of either 20 dichotomous testlet-based items or 20 four-point polytomous testlet-based items. All the g variances (testlet effects) were set at one, suggesting a rather large testlet effect. The number of testlets in the 20-item test was either two (each testlet had 10 items—a long testlet) or four (each testlet had 5 items—a short testlet). Although the g variances were all equal across testlets, their effects were stronger in the long testlets. For dichotomous items, the item response functions in each test followed the oneparameter, two-parameter, or three-parameter testlet response model. For polytomous items, the item response functions in each test followed the generalized partial credit, Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 498 Educational and Psychological Measurement 73(3) partial credit, rating scale, or graded response testlet response model. The factor loadings (the correlations between the second-order latent trait and the three first-order latent traits) were set at .9, .8, and .7, respectively. In the dichotomous items, the difficulty parameters were sampled from U(22, 2), the slope parameters were sampled from N(1, 0.25) and truncated above 0, and the pseudo-guessing parameters were sampled from U(0, 0.3). In the polytomous items, the step difficulty parameters were generated from U(22.5, 2.5), and the location parameters were generated from U(21.5, 1.5) with 21, 0, and 1 as the category threshold parameters, respectively. The specifications of item parameters were in line with those commonly found in practice. The item parameters were generated independently, consistent with the IRT literature. If necessary, multivariate normal distribution can be assumed for the item parameters (van der Linden, Klein Entink, & Fox, 2010). A total of 5,000 persons were generated when the item response function of dichotomous items followed the three-parameter model. However, a total of 2,000 persons were generated when the item response function followed the other models because a large sample size is required for estimating the pseudo-guessing parameters. All the variances of latent traits were set at one. The simulated data sets were analyzed in two ways: First, the data-generating model was also the analysis model (i.e., the HTM was fit to HTM data). Second, the standard higher order IRT model (HIRT) was fit to HTM data. We wrote a Matlab computer program to generate item responses. Ten replications were made under each condition, mainly because each replication could take several days of computer time, and there were many conditions. In our experience, 10 replications appeared to be sufficient to gain reliable inferences because the sampling variation across replications appeared to be rather small. Other studies that used Bayesian estimation for complicated IRT models also used 10 replications or fewer (Bolt & Lall, 2003; Cao & Stokes, 2008; Klein Entink et al., 2009; Li, Bolt, & Fu, 2006; van der Linden et al., 2010). Analysis The freeware WinBUGS 1.4 with MCMC methods was used to estimate the parameters. We specified a normal prior with mean 0 and variance 4 for all the location and threshold parameters, a lognormal prior with mean 0 and variance 1 for the slope parameters, a beta prior with both hyperparameters equal to 1 for the pseudo-guessing parameters, a normal prior with mean 0.5 and variance 10 accommodating a truncated distribution between 0 and 1 for the factor loadings, and a gamma prior with both hyperparameters equal to 0.01 for the inverse of g variances (testlet effects). A total of 15,000 iterations were obtained with the first 5,000 iterations as the burn in. These settings were consistent with the literature when WinBUGS was used to fit IRT models. After monitoring the convergence diagnostic according to the multivariate potential scale reduction factor (Brooks & Gelman, 1998) with three parallel chains for Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 499 the first simulated data set across all analysis models, we found that the chain lengths were sufficient to reach stationarity for all structural parameters. In addition, the last 200 MCMC samples for latent trait estimates were used to compute test reliability (precision of person measures). For each estimator, we computed the bias and the root mean square error (RMSE): Bias(E(zr )) = R X (EAP(^zr ) z)=10, ð15Þ r=1 vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u R uX 2 RMSE(E(zr )) = t (EAP(^zr ) z) =10, ð16Þ r=1 where z was the generating value and EAP(^zr ) was the expected a posterior estimate in replication r. It was expected that (a) the parameter recovery would be satisfactory when the HTM was fit to HTM data because the data-generating model and the analysis model were identical; (b) the parameter estimation would be biased and the test reliability would be overestimated when the HIRT was fit to HTM data because the testlet effect existed but was not considered; (c) the longer the testlet, the worse the estimation of parameters and test reliability would be when the testlet effect was not considered; and (d) the DIC would favor the data-generating HTM rather than the corresponding HIRT, and the PPMC would show a good model–data fit for the generating HTM and a poor fit for the corresponding HIRT. Results Parameter Recovery of Dichotomous Items Due to space constraints, the bias and RMSE for individual parameters are not reported; instead, their means and standard deviations are provided. Table 1 summarizes the bias values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. The bias values yielded by the HTM appeared to be very close to zero and were consistent with those found in the literature when data-generating dichotomous testlet response models were fit (Bradlow et al., 1999; Wainer et al., 2000; W.-C. Wang & Wilson, 2005). In comparison, the 1PHTM yielded bias values slightly better than those by the 2P-HTM, which were slightly better than those by the 3P-HTM. The difference in bias values between long and short testlets was small. When the testlet effect was ignored and the HIRT was fit to HTM data, the bias values were many times more extreme than those when the HTM was fit. The underestimation for the three factor loadings by the HIRT was the most significant, with bias values ranging from 20.193 to 20.043. Table 2 summarizes the RMSE values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. The HTM yielded small Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 500 Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 — — 9 8 213 — — 24 15 23 46 68 — — 10 26 12 5 27 60 50 28 28 211 34 27 5 4 44 Short 218 43 Long 11 21 Short 2-Parameter 40 30 1 6 3 16 42 19 50 51 154 Long 25 43 6 29 0 29 53 62 72 103 186 Short 3-Parameter — — 2178 2158 2149 — — 97 5 9 132 Long — — 270 269 243 — — 237 2 30 158 Short 1-Parameter — — 2167 2137 2128 — — 90 101 28 212 Long — — 276 267 275 — — 254 79 26 113 Short 2-Parameter HIRT Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model; — = not applicable. Difficulty Mean 29 SD 19 Slope Mean 220 SD 10 Pseudo-guessing Mean — SD — Factor loading 22 b1 19 b2 21 b3 Testlet variance Mean 5 SD 19 Long 1-Parameter HTM Table 1. Bias of Dichotomous Items (Multiplied by 1,000) When the HTM and HIRT Were Fit to HTM Data. — — 2193 2147 2121 46 123 223 264 186 496 Long — — 277 286 266 45 91 6 151 169 356 Short 3-Parameter 501 Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 110 43 81 31 — — 34 28 41 175 61 23 5 — — 29 24 26 118 26 Long 63 15 Short 167 55 18 22 28 — — 90 34 110 47 Short 2-Parameter 122 41 31 27 24 57 32 114 57 209 137 Long 129 35 17 21 17 70 42 179 110 253 159 Short 3-Parameter — — 179 159 153 — — 101 6 133 53 Long — — 73 72 46 — — 43 4 156 66 Short 1-Parameter — — 169 139 131 — — 141 71 207 105 Long — — 78 69 80 — — 106 59 141 62 Short 2-Parameter HIRT Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model; — = not applicable. Difficulty Mean 64 SD 17 Slope Mean 41 SD 5 Pseudo-guessing Mean — SD — Factor loading 19 b1 27 b2 39 b3 Testlet variance Mean 81 SD 18 Long 1-Parameter HTM — — 194 149 123 116 87 332 196 446 346 Long — — 78 88 68 99 67 185 86 367 255 Short 3-Parameter Table 2. Root Mean Square Error of Dichotomous Items (Multiplied by 1,000) When the HTM and HIRT Were Fit to HTM Data. 502 Educational and Psychological Measurement 73(3) RMSE values, especially for the 1P-HTM. In contrast, when the testlet effect was ignored and the HIRT was fit, the RMSE values became many times larger than those in the HTM, especially for the factor loadings. A factorial analysis of variance was conducted to evaluate the effects of the three independent variables: (a) method (HIRT or HTM), (b) testlet length (short or long), and (c) model (1P, 2P, or 3P). When the RMSE of difficulty parameters was the dependent variable, the partial h2 was 17% for the main effect of model, 9% for the main effect of method, and less than 2% for the other effects. When the RMSE of discrimination parameters was the dependent variable, the partial h2 was 10% for the main effect of method and less than 2% for the other effects. When the RMSE of pseudo-guessing parameters was the dependent variable, the partial h2 was 8% for the main effect of method and less than 1% for the other effects. When the RMSE of factor loadings was the dependent variable, the partial h2 was 81% for the main effect of method, 52% for the main effect of testlet length, 43% for the interaction effect for method and testlet length, and less than 3% for the other effects. In short, method had a large impact on the parameter recovery, which demonstrated that ignoring the testlet effects by fitting the HIRT to HTM data yielded poor parameter estimation. Parameter Recovery of Polytomous Items Table 3 summarizes the bias values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. Similar to those found in dichotomous items, the bias values yielded by the HTM were very close to zero and were consistent with those found in the literature when data-generating polytomous testlet response models were fit (W.-C. Wang & Wilson, 2005; X. Wang, Bradlow, & Wainer, 2002). In comparison, the Rasch family models (partial credit and rating scale models) had a slightly smaller bias than the two-parameter models (generalized partial credit and graded response models). The difference in bias values between long and short testlets was small. When the testlet effect was ignored and the HIRT was fit to HTM data, most bias values were many times more extreme than those when the HTM was fit. The three factor loadings were substantially underestimated. Table 4 summarizes the RMSE values when the data-generating HTM (left-hand side) and the HIRT (right-hand side) were fit to HTM data. The results were consistent with those found in dichotomous items (Table 2). The HTM yielded small RMSE values, whereas the HIRT yielded RMSE values many times larger than those in the HTM. According to the factorial analysis of variance, when the RMSE of location parameters was the dependent variable, the partial h2 was 43% for the main effect of method and less than 3% for the other effects. When the RMSE of discrimination parameters was the dependent variable, the partial h2 was 5% for the main effect of method and less than 1% for the other effects. When the RMSE of factor loadings was the dependent variable, the partial h2 was 73% for the main effect of method, 48% for the main effect of testlet length, 42% for the interaction effect for method and testlet length, 16% for the interaction effect of method and model, 10% Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 503 Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 23 57 227 25 25 12 7 67 60 225 3 27 4 22 26 22 Long 1 22 Short 43 29 0 23 21 216 18 8 42 Short Generalized Partial Credit 29 15 29 0 6 212 3 22 15 Long 3 15 1 28 210 27 6 3 12 Short Rating Scale 45 23 24 29 22 213 17 25 22 Long 27 27 26 214 6 213 25 211 21 Short Graded Response — — 2124 2118 2127 260 7 214 315 Long — — 287 279 276 2235 3 213 321 Short Partial Credit — — 2186 2140 2119 256 151 220 326 Long — — 287 277 269 2231 149 0 329 Short Generalized Partial Credit — — 2161 2153 2122 261 3 10 329 — — 277 293 284 2238 7 19 324 Short Rating Scale Long HIRT Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model; — = not applicable. Location Mean 212 SD 29 Slope Mean 224 SD 11 Factor loading 25 b1 8 b2 213 b3 Testlet variance Mean 22 SD 24 Long Partial Credit HTM Table 3. Bias of Polytomous Items (Multiplied by 1,000) When the HTM and HIRT Was Fit to HTM Data. — — 2182 2160 2131 258 160 18 323 Long — — 291 290 258 2243 157 25 315 Short Graded Response 504 Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 114 42 63 21 21 26 35 144 31 79 23 32 5 22 16 27 75 21 110 21 14 24 30 61 22 110 43 60 11 42 27 25 19 2 43 10 Long 61 7 20 21 28 17 0 44 9 Short Rating Scale 87 19 25 29 29 50 19 56 16 Long 77 17 24 31 23 61 26 55 15 Short Graded Response — — 132 126 132 64 4 293 155 Long — — 87 82 78 236 3 281 160 Short Partial Credit — — 187 142 122 151 86 302 160 Long — — 89 80 75 238 144 307 167 Short Generalized Partial Credit HIRT Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model; — = not applicable. Location Mean 82 SD 23 Slope Mean 37 SD 5 Factor loading 29 b1 36 b2 31 b3 Testlet variance Mean 60 SD 6 Short Long Long Short Generalized Partial Credit Partial Credit HTM — — 165 155 124 63 3 285 168 Long — — 78 93 87 238 7 286 159 Short Rating Scale Table 4. Root Mean Square Error of Polytomous Items (Multiplied by 1,000) When the HTM and HIRT Was Fit to HTM Data. — — 183 161 133 158 91 280 169 Long — — 93 93 61 249 154 279 153 Short Graded Response Huang and Wang 505 for the main effect of model, and less than 2% for the other effects. Apparently, method played a major role in parameter recovery, and ignoring testlet effect yielded poor parameter estimation. Test Reliability The mean test reliabilities for dichotomous items and polytomous items across 10 replications are shown in Table 5. The test reliability obtained from the HTM was treated as a gold standard to which the test reliability obtained from the HIRT was compared, because the HTM was the data-generating model. It was found that fitting the HIRT to the HTM data (both dichotomous and polytomous items) substantially overestimated the test reliability for the first-order latent traits, because the testlet effect was mistakenly treated as a true variance. However, the test reliability of the second-order latent trait appeared to be less affected, which might be because the overestimation in the test reliability of the first-order latent traits and the underestimation of the factor loadings cancelled out such that the test reliability of the secondorder latent trait was estimated appropriately in the HIRT. Posterior Predictive Model Checking When the Bayesian chi-square and the reproduced correlation were used to evaluate model–data fit using the PPMC, as Table 6 shows, fitting the HTM to HTM data consistently yielded a good model–data fit (slightly overconservative); however, fitting the HIRT to the HTM data yielded a poor fit only for the one-parameter models. In other words, these two statistics appeared not to be sensitive enough to model misspecification in the multiparameter models. With respect to the odds ratio statistic for dichotomous items, the mean number of misfits across 120 item pairs was computed across 10 replications. According to this statistic, fitting the HTM yielded a very small number of misfits, whereas fitting the HIRT yielded a very large number of misfits, especially for the one-parameter models. In addition, the longer the testlets were, the more powerful the odds ratio statistic. With respect to the rank correlation for polytomous items, the mean number of misfits across 120 item pairs was computed. Like the odds ratio statistic for dichotomous items, the rank correlation performed fairly well in distinguishing the true and wrong models. In addition to the PPMC, the Bayesian DIC was computed to compare the HTM and HIRT. It was found that the HTM always had a smaller DIC value than the corresponding HIRT. Due to a small number of replications, the above PPMC results should be interpreted with caution. Two Empirical Examples Example 1: Ability Tests With Dichotomous Items In Taiwan, junior high school graduates who wish to enter senior high schools have to take the Basic Competence Tests for Junior High School Students, which consists Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 506 Educational and Psychological Measurement 73(3) Table 5. Mean Test Reliability for Dichotomous and Polytomous Items Across Replications. Long Dichotomous items 1-Parameter Test 1 Test 2 Test 3 2nd order 2-Parameter Test 1 Test 2 Test 3 2nd order 3-Parameter Test 1 Test 2 Test 3 2nd order Polytomous items Partial credit Test 1 Test 2 Test 3 2nd order Generalized partial credit Test 1 Test 2 Test 3 2nd order Rating scale Test 1 Test 2 Test 3 2nd order Graded response Test 1 Test 2 Test 3 2nd order Short HTM HIRT HTM HIRT .67 .65 .63 .66 .81 .81 .81 .65 .72 .71 .70 .71 .79 .79 .79 .71 .66 .66 .62 .66 .80 .83 .80 .66 .71 .72 .68 .70 .77 .80 .77 .70 .62 .62 .59 .62 .76 .78 .73 .61 .67 .67 .63 .67 .72 .75 .70 .66 .71 .70 .68 .70 .92 .92 .92 .69 .79 .78 .77 .75 .90 .90 .90 .76 .71 .70 .69 .70 .91 .92 .91 .69 .78 .79 .76 .75 .89 .90 .88 .75 .72 .70 .69 .70 .92 .92 .92 .70 .79 .78 .77 .76 .90 .90 .90 .77 .70 .70 .68 .69 .91 .92 .91 .68 .78 .78 .77 .75 .89 .90 .89 .74 Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model. All the standard deviations of test reliabilities were between .01 and .02. of five subjects: language, mathematics, English, social sciences, and nature sciences. In the 2006 tests, there were 48 items in language, 33 items in mathematics, 45 items in English, 63 items in social sciences, and 58 items in nature sciences; all were in Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 507 Table 6. Frequencies of Misfit in Posterior Predictive Model Checking in 10 Replications. Long 1-Parameter Bayesian chi-square Reproduced correlation Odds ratio (mean) 2-Parameter Bayesian chi-square Reproduced correlation Odds ratio (mean) 3-Parameter Bayesian chi-square Reproduced correlation Odds ratio (mean) Partial credit Bayesian chi-square Reproduced correlation Rank correlation (mean) Generalized partial credit Bayesian chi-square Reproduced correlation Rank correlation (mean) Rating scale Bayesian chi-square Reproduced correlation Rank correlation (mean) Graded response Bayesian chi-square Reproduced correlation Rank correlation (mean) Short HTM HIRT HTM HIRT 0 0 0.34 8 8 9.38 0 0 0.42 0 0 9.06 0 0 0.00 0 0 8.71 0 0 0.07 0 0 6.03 0 0 0.02 0 0 8.73 0 0 0.06 0 0 5.85 0 0 0.15 0 1 10.00 0 0 0.21 10 10 9.70 0 0 0.00 0 0 9.68 0 0 0.01 0 0 9.23 0 0 0.14 0 2 10.00 0 0 0.31 10 10 9.70 0 0 0.00 0 0 9.68 0 0 0.00 0 0 9.34 Note: HTM = higher order testlet response model; IRT = item response theory; HIRT = higher order IRT model. multiple-choice format. Every test consisted of both independent items and testletbased items. Altogether, there were 23 testlets in the five tests. Each of the five tests was treated as measuring a first-order latent trait. The five first-order latent traits were treated as governed by the same second-order latent trait, referred to as ‘‘academic ability.’’ A total of 5,000 examinees (2,639 boys and 2,361 girls) were randomly sampled from more than 300,000 examinees. The HTM and HIRT with item response functions following the 1PLM, 2PLM, and 3PLM were fit to the data, resulting in a total of six models. WinBUGS was used for parameter estimation. According to the Bayesian DIC, the 3P-HTM had the best fit. According to the PPMC with the Bayesian chi-square, reproduced correlation, and odds ratio statistics, the 3P-HTM also had a good fit. The estimates were between Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 508 Educational and Psychological Measurement 73(3) 0.60 and 10.71 (M = 2.60) for the slope parameters, between 22.72 and 1.53 (M = 20.19) for the difficulty parameters, and between 0.02 and 0.47 (M = 0.21) for the pseudo-guessing parameters. The factor loadings were .95, .90, .94, .96, and .98, and the test reliabilities were .95, .94, .94, .96, and .96, for language, English, mathematics, social sciences, and nature sciences, respectively. The test reliability for the second-order latent trait was .97. The high factor loadings were not surprising, because these five tests were designed to measure basic competence for junior high school students. The standard deviations of g for the 23 testlets were between 0.10 and 0.52 (M = 0.26). Compared to the unit standard deviation of the first-order latent traits (i.e., set at 1 for model identification), the random testlet effects ranged from trivial to moderate. When the testlet effects were not considered by fitting the 3P-HIRT to the data, the resulting factor loadings and test reliabilities were very similar to those found in the 3P-HTM, mainly because the testlet effects were rather small. Example 2: Nonability Tests With Polytomous Items Three tests were used to measure pathological Internet use in Taiwan (Lin & Liu, 2003), including the Internet Hostility Questionnaire (31 four-point items), the Chinese Internet Addiction Scale (26 four-point items), and the Questionnaire of Cognitive Distortions of Pathological Internet Use (16 four-point items). A single total score was reported for each test: the higher the total score, the more serious the Internet hostility, Internet addiction, or cognitive distortion. Each of the three tests was treated as measuring a first-order latent trait. The three first-order latent traits were treated as governed by a second-order latent trait, referred to as ‘‘pathological Internet use.’’ Each test consisted of several ‘‘subtests,’’ which were treated as testlets, because after partitioning out the target latent trait (u), the specific latent trait (g) left for each subtest was considered unimportant (nuisance). There were 15 testlets altogether in the three tests. A total of 987 Taiwanese Internet users (college students—625 males and 362 females) responded to the three tests. The HTM and HIRT with item response functions following the PCM, GPCM, RSM, and GRM were fit to the data, resulting in a total of eight models. WinBUGS was used. According to the Bayesian DIC, the GP-HTM had the best fit. According to the PPMC with the Bayesian chi-square, reproduced correlation, and rank correlation, the GP-HTM also had a good fit. The estimates were between 0.10 and 6.77 (M = 1.59) for the slope parameters and between 28.88 and 6.49 (M = 0.44) for the location parameters. The factor loadings were .68, .61, and .83, and the test reliabilities were .82, .89, and .81 for Internet addiction, Internet hostility, and cognitive distortions, respectively. The test reliability for the second-order latent trait was .71. It appeared that the factor loading of cognitive distortions contributed more to pathological Internet use than Internet hostility and Internet addiction. The standard deviations of g for the 15 testlets were between 0.40 and 3.23 (M = 1.41). Compared to the unit standard deviation of the first-order latent traits, most of the testlet effects were very large. When the large testlet effects were not considered by fitting the GPC-HIRT to Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 509 the data, the factor loadings were .60, .61, and .80, and the test reliabilities were .91, .94, and .88, for Internet addiction, Internet hostility, and cognitive distortions, respectively. The test reliability for the second-order latent trait was .70. Apparently, ignoring the testlet effects led to an underestimation of factor loadings and an overestimation of test reliabilities for the first-order latent traits, which is consistent with results found in the simulations. Discussion and Conclusion The new class of HTM was developed for tests measuring hierarchical latent traits with testlet-based items. The HTM is very flexible because the item response function is not limited to any specific function. Different items can have different item response functions, even in the same test. Due to high dimensionality, we used the Bayesian approach with MCMC methods that were implemented on WinBUGS for parameter estimation. A series of simulations were conducted to evaluate parameter recovery of the HTM, the consequences of ignoring testlet effects on estimation of parameters and test reliability, and the effectiveness of model–data fit statistics. The results show that (a) the parameters of the HTM were recovered very well; (b) when testlet effect existed but was not considered by fitting the HIRT, the resulting item parameter estimates were biased, the factor loadings were underestimated, and the test reliability of the first-order latent traits was overestimated; and (c) the Bayesian DIC and PPMC were helpful in model comparison and model-data fit checking. The simulations were conducted on personal computers with 2.66-GHz Intel Core i5. It took approximately 2 to 6 days per replication. The computation time is feasible for most real data analyses but may not be feasible for comprehensive simulations, which was the main reason why only 10 replications were conducted in this study. To facilitate parameter estimation of the HTM, one may adopt standard IRT computer programs, such as BILOG-MG, MULTILOG, or PARSCALE, to fit each test at the first order, and then use the parameter estimates as starting values for the HTM using WinBUGS. Even better, a stand-alone computer program should be developed for the HTM. In the HTM, we formulated two orders of latent traits. Actually, the HTM can be easily extended to accommodate more orders but at the cost of the computational burden. In recent years, multilevel IRT models have been developed to describe multilevel structure in persons (e.g., persons are nested within schools; Fox, 2005). It is of great value to combine both the HTM and multilevel models. Future studies can be conducted to explore these theoretical issues and develop more efficient computer programs for these complicated models. Declaration of Conflicting Interests The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 510 Educational and Psychological Measurement 73(3) Funding The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The first author was supported by the National Science Council (Grant No. NSC 100-2410-H-133-015). References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinees’ ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397-479). Reading, MA: Addison-Wesley. Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 27, 395-414. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlet. Psychometrika, 64, 153-168. Brooks, S. P., & Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434-455. Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing behaviors. Psychometrica, 73, 209-230. de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353. de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size a higher order IRT model approach. Applied Psychological Measurement, 34, 267-285. de la Torre, J., & Song, H. (2009). Simultaneously estimation of overall and domain abilities: A higher order IRT model approach. Applied Psychological Measurement, 33, 620-639. Fox, J.-P. (2005). Multilevel IRT using dichotomous and polytomous items. British Journal of Mathematical and Statistical Psychology, 58, 145-172. Fox, J.-P., & Glas, C. A. W. (2003). Bayesian modeling of measurement error in predictor variables using item response theory. Psychometrika, 68, 169-191. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409. Gelman, A., Meng, X.-L., & Stern, H. S. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6, 733-807. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721-741. Huang, H.-Y., Wang, W.-C., & Chen, P.-H. (2010). An item response model with hierarchical latent traits. Paper presented at the annual meeting of the American Educational Research Association, Denver, CO. Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31, 331-358. Klein Entink, R. H., Fox, J.-P., & van der Linden, W. J. (2009). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74, 21-48. Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30, 3-21. Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015 Huang and Wang 511 Lin, S. S. J., & Liu, E. Z.-F. (2003). Study on Internet flaming and hostility: Cognitive process, situational factors, and the building of an automatic sorting system of hostile posters (I) (NSC Research Report No. NSC92-2520-S009-005). Taipei, Taiwan: National Science Council. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen, Denmark: Institute of Educational Research. (Expanded edition, 1980; Chicago, IL: University of Chicago Press) Ryan, J. J., & Schnakenberg-Ott, S. D. (2003). Scoring reliability on the Wechsler Adult Intelligence Scale–Third Edition (WAIS-III). Assessment, 10, 151-159. Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and Psychological Measurement, 68, 413-430. Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30, 298-321. Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B: Methodological, 64, 583-616. Spiegelhalter, D. J., Thomas, A., & Best, N. (2003). WinBUGS version 1.4 [Computer program]. Cambridge, England: MRC Biostatistics Unit, Institute of Public Health. Thissen, D., Steinberg, L., & Mooney, J. A. (1989). Trace lines for testlet: A use of multiplecategorical-response models. Journal of Educational Measurement, 26, 247-260. van der Linden, W. J., Klein Entink, R. H., & Fox, J.-P. (2010). IRT parameter estimation with response times as collateral information. Applied Psychological Measurement, 34, 327-347. Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful in adaptive testing. In W. J. van der Linden & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245-269). Dordrecht, the Netherlands: Kluwer Academic. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York, NY: Cambridge University Press. Wainer, H., & Lukhele, R. (1997). How reliable are TOEFL scores? Educational and Psychological Measurement, 57, 741-758. Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15, 22-29. Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203-220. Wang, W.-C., & Wilson, M. R. (2005). The Rasch testlet model. Applied Psychological Measurement, 29, 126-149. Wang, X., Bradlow, E. T., & Wainer, H (2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26, 109-128. Downloaded from epm.sagepub.com at NATIONAL TAIWAN NORMAL UNIV LIB on May 13, 2015