Characterizing the Performance of the Conway-Maxwell
Transcription
Characterizing the Performance of the Conway-Maxwell
2 Characterizing the Performance of the Conway-Maxwell Poisson Generalized Linear Model 3 4 5 Royce A. Francis1, Srinivas Reddy Geedipally2, Seth D. Guikema1, Soma Sekhar Dhavala3, Dominique Lord2, Sarah LaRocca1 1 6 7 8 9 10 11 12 13 1 Department of Geography and Environmental Engineering, Johns Hopkins University, 313 Ames Hall, 3400 N. Charles St., Baltimore, MD 21218 {[email protected], [email protected], [email protected]} 2 Zachry Department of Civil Engineering, Texas A&M University, 3136 TAMU, College Station, TX 778433136 {[email protected], [email protected]} 3 Department of Statistics, Texas A&M University, 3143 TAMU, College Station, TX 77843-3143 {[email protected]} 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Authors’ Footnote: Royce Francis is Postdoctoral Fellow, Seth Guikema is Assistant Professor, and Sarah LaRocca is Graduate Research Assistant in the Department of Geography and Environmental Engineering, The Johns Hopkins University, Baltimore, MD 21218 (e-mail: [email protected], [email protected], [email protected]). Srinivas Geedipally is Engineering Research Associate, Dominique Lord is Assistant Professor, Texas Transportation Institute, College Station, TX 77843 (e-mail: [email protected], [email protected]). Soma Dhavala is Postdoctoral Fellow, Department of Statistics, Texas A&M University, College Station, TX 77843 (email: [email protected]). This work was partially supported by the National Science Foundation (grant ECCS-0725823), the U.S. Department of Energy (grant BER- FG02-08ER64644) and the Whiting School of Engineering. All opinions in this paper are those of the authors and do not necessarily reflect the views of the sponsors. Characterizing the Performance of the Conway-Maxwell Poisson GLM 27 ABSTRACT 28 This paper documents the performance of a MLE for the Conway-Maxwell-Poisson (COM- 29 Poisson) generalized linear model (GLM). This distribution was originally developed as an 30 extension of the Poisson distribution in 1962 and has a unique characteristic in that it can 31 handle both under-dispersed and over-dispersed count data. Parameter estimation for this 32 model is done using a maximum likelihood estimation (MLE) approach for the COM- 33 Poisson single-link function. The objectives of this paper are to (1) characterize the 34 parameter estimation accuracy of the MLE implementation of the COM-Poisson GLM and 35 (2) estimate the prediction accuracy of the COM-Poisson GLM. We use simulated datasets to 36 assess the performance of the COM-Poisson GLM. The results of the study indicate that the 37 COM-Poisson GLM is flexible enough to model under-, equi- and over-dispersed datasets 38 with different sample mean values. The results also show that the COM-Poisson GLM yields 39 accurate parameter estimates. Furthermore, we show that the Shmueli et al (2005) asymptotic 40 approximation of the mean of the COM-Poisson distribution is an accurate approximation 41 even when the sample mean of the data is substantially below the lower bound previously 42 suggested. However, the approximation is less accurate for very lower sample mean values, 43 and we characterize the degree of this inaccuracy. The COM-Poisson GLM provides a 44 promising and flexible approach for performing count data regression. 45 46 Keywords: Generalized linear model, count data, maximum likelihood estimation, regression 47 1 Characterizing the Performance of the Conway-Maxwell Poisson GLM 47 1. INTRODUCTION 48 The Poisson family of discrete distributions stands as a benchmark for analyzing count 49 data. However, because the Poisson distribution itself has well-known limitations due to the 50 assumed mean-variance structure, a number of generalizations have been proposed. The 51 Conway-Maxwell Poisson (COM-Poisson) distribution is one of these generalizations, 52 Originally developed in 1962 as a method for modeling both under-dispersed and over- 53 dispersed count data (Conway and Maxwell 1962), the COM-Poisson distribution was then 54 revisited by Shmueli et al. (2005) after a period in which it was not widely used. Shmueli et 55 al. (2005) derived many of the basic properties of the distribution. The COM-Poisson 56 belongs to the exponential family as well as to the two-parameter power series family of 57 distributions. It introduces an extra parameter, 58 successive ratios of probabilities. It nests the usual Poisson ( 59 Bernoulli ( 60 the Poisson distribution (Boatwright, Borle and Kadane 2003, Shmueli, et al. 2005). The 61 conjugate priors for the parameters of the COM-Poisson distribution have also been derived 62 (Kadane, Shmueli, Minka, Borle and Boatwright 2006). , which governs the rate of decay of = 1), geometric ( = 0) and = ∞) distributions and it allows for both thicker and thinner tails than those of 63 The COM-Poisson distribution has recently become much more widely used, including 64 studies analyzing word length (Shmueli et al. 2005), birth process models (Ridout and 65 Besbeas 2004), prediction of purchase timing and quantity decisions (Boatwright et al. 2003), 66 quarterly sales of clothing (Shmueli et al. 2005), internet search engine visits (Telang, 67 Boatwright and Mukhopadhyay 2004), the timing of bid placement and extent of multiple 68 bidding (Borle, Boatwright and Kadane 2006), developing cure rate survival models 69 (Rodrigues et al. 2009), modeling electric power system reliability (Guikema and Coffelt 2 Characterizing the Performance of the Conway-Maxwell Poisson GLM 70 2008), modeling the number of car breakdowns (Mamode Khan and Jowaheer 2010) and 71 modeling motor vehicle crashes (***** 2008, 2010). 72 Only recently has the COM-Poisson distribution been used in a generalized linear model 73 setting, and the estimation efficiency and bias has not been adequately assessed. Guikema 74 and Coffelt (2006, 2008) introduced a generalized linear model (GLM) built on the COM- 75 Poisson, and ***** (2008, 2010) then utilized this model to analyze traffic accident data. The 76 approach of Guikema and Coffelt (2006, 2008) depended on MCMC for fitting a dual-link 77 GLM based on the COM-Poisson distribution. It also used a reformulation of the COM- 78 Poisson to provide a more direct centering parameter than the original COM-Poisson 79 formulation. Sellers and Shmueli (2008) developed an MLE for a single-link GLM based on 80 the original COM-Poisson distribution. Jowaheer and Mamode Khan (2009) compared the 81 efficiency of quasi-likelihood and MLE estimation approaches for estimating the parameters 82 of a single-link COM-Poisson GLM in the case of equidispersion based on simulated data 83 sets. 84 Aside from the limited tests of Jowaheer and Mamode Khan (2009), there has not been 85 any evaluation of the accuracy or bias of parameter estimates from the COM-Poisson 86 distribution, particularly for cases with underdisperson and overdispersion. Given that a 87 major advantage of the COM-Poisson GLM is its ability to handle both underdispersion and 88 overdispersion within a single conditional distribution, testing the estimation accuracy and 89 bias is a critical need. The objectives of this paper are to: (1) evaluate the estimation accuracy 90 of the Guikema and Coffelt COM-Poisson GLM for datasets characterized by overdispersion, 91 underdispersion and equidispersion with different means based on a maximum likelihood 92 estimator, and (2) characterize the accuracy of the asymptotic approximation of the mean of 3 Characterizing the Performance of the Conway-Maxwell Poisson GLM 93 the COM-Poisson distribution suggested by Shmueli et al. (2005) for the Guikema and 94 Coffelt COM-Poisson GLM. In addition, we briefly discuss potential convergence issues 95 with the infinite series in the normalizing factor of the distribution. We based our analysis on 96 900 simulated data sets representing 9 different mean-variance relationships spanning 97 underdispersion, overdispersion, and equidispersion for low, moderate, and high means. This 98 paper is organized as follows. The next section describes the COM-Poisson distribution and 99 its GLM framework. The third section presents our research method. The fourth section gives 100 the results of our computational study. The fifth section gives a brief discussion of the 101 results, and the sixth section provides concluding comments. 102 103 104 2. BACKGROUND This section describes the characteristics of the COM-Poisson distribution and the COMPoisson GLM framework. 105 Parameterization. The COM-Poisson distribution was first introduced by Conway and 106 Maxwell (1962) for modeling queues and service rates. Although the COM-Poisson 107 distribution is not particularly new, it has been largely unstudied and unused until Shmueli et 108 al. (2005) derived the basic properties of the distribution. 109 The COM-Poisson distribution is a two-parameter extension of the Poisson distribution 110 that generalizes some well-known distributions including the Poisson, Bernoulli, and 111 geometric distributions (Shmueli et al. 2005). It also offers a more flexible alternative to 112 distributions derived from these discrete distributions, such as the binomial and negative 113 binomial distributions. The COM-Poisson distribution can handle both under-dispersion 4 Characterizing the Performance of the Conway-Maxwell Poisson GLM 114 (variance less than the mean) and over-dispersion (variance greater than the mean). The 115 probability mass function (PMF) of the COM-Poisson for the discrete count Y is given as: 116 (1) 117 (2) 118 Here 119 is the shape parameter of the COM-Poisson distribution. Z is the normalizing constant. The 120 condition ν >1 corresponds to under-dispersed data, ν <1 to over-dispersed data, and ν =1 to 121 equi-dispersed (Poisson) data. Several common PMFs are special cases of the COM-Poisson 122 with the original formulation. Specifically, setting ν = 0 yields the geometric distribution, λ < 123 1 and ν→∞ yields the Bernoulli distribution in the limit, and ν = 1 yields the Poisson 124 distribution. This flexibility greatly expands the types of problems for which the COM- 125 Poisson distribution can be used to model count data. 126 127 is a centering parameter that is directly related to the mean of the observations and The asymptotic expressions for the mean and variance of the COM-Poisson derived by Shmueli et al. (2005) are given by Equations (3) and (4) below. 128 (3) 129 (4) 5 Characterizing the Performance of the Conway-Maxwell Poisson GLM 130 The COM-Poisson distribution does not have closed-form expressions for its moments in 131 terms of the parameters λ and ν. However, the mean can be approximated through a few 132 different approaches, including (i) using the mode, (ii) including only the first few terms of Z 133 when ν is large, (iii) bounding E[Y] when ν is small, and (iv) using an asymptotic expression 134 for Z in Equation (1). Shmueli et al. (2005) used the last approach to derive the 135 approximation for Z and the mean as: 136 (5) 137 Using the same approximation for Z as in Shmueli et al. (2005), the variance can be 138 approximated as: 139 140 141 (6) Shmueli et al (2005) suggest that these approximations may not be accurate for ν>1 or . 142 Despite its flexibility and attractiveness, the COM-Poisson has limitations in its 143 usefulness as a basis for a GLM, as documented in Guikema and Coffelt (2008). In 144 particular, neither λ nor ν provide a clear centering parameter. While λ is approximately the 145 mean when ν is close to one, it differs substantially from the mean for small ν. Given that ν 146 would be expected to be small for over-dispersed data, this would make a COM-Poisson 147 model based on the original COM-Poisson formulation difficult to interpret and use for over- 148 dispersed data. 6 Characterizing the Performance of the Conway-Maxwell Poisson GLM 149 Guikema and Coffelt (2008) proposed a re-parameterization using a new parameter 150 to provide a clear centering parameter. This new formulation of the COM-Poisson is 151 summarized in Equations (7) and (8) below: 152 (7) 153 (8) 154 By substituting 155 variance of Y are given in terms of the new formulation as 156 = in equations (4), (5), and (41) of Shmueli (2005), the mean and with asymptotic approximations and and 157 especially accurate once µ>10. With this new parameterization, the integral 158 part of µ is the mode leaving µ as a reasonable centering parameter. The substitution 159 also allows ν to keep its role as a shape parameter. That is, if ν < 1, the variance is 160 greater than the mean while ν > 1 leads to under-dispersion. In this paper we investigate the 161 accuracy of the approximation more closely. 162 This new formulation provides a good basis for developing a COM-Poisson GLM. The 163 clear centering parameter provides a basis on which the centering link function can be built, 164 allowing ease of interpretation across a wide range of values of the shape parameter. 165 Furthermore, the shape parameter ν provides a basis for using a second link function to allow 7 Characterizing the Performance of the Conway-Maxwell Poisson GLM 166 the amount of over-dispersion, equi-dispersion or under-dispersion to vary across 167 measurements. 168 Generalized Linear Model. Guikema and Coffelt (2008) developed a COM-Poisson 169 GLM framework for modeling discrete count data using the reformulation of the COM- 170 Poisson given in equations (7) and (8). This dual-link GLM framework, in which both the 171 mean and the variance depend on covariates, is given in equations (9-12), where Y is the 172 count random variable being modeled and xi and zi are covariates. There are p covariates used 173 in the centering link function and q covariates used in the shape link function. The sets of 174 parameters used in the two link functions do not need to be identical. If a single-link model is 175 desired, the second link given by equation (12) can be removed allowing a single ν to be 176 estimated directly. 177 (9) 178 (10) 179 (11) 180 (12) 8 Characterizing the Performance of the Conway-Maxwell Poisson GLM 181 In our simulation exercise, we employ equation 11 and 12 as a single link function. The 182 COM-Poisson log-likelihood for a single observation with single-link function may now be 183 written as: 184 185 (13) Let . The log-likelihood for the entire dataset (N observations) is given as: 186 (14) 187 To obtain the maximum likelihood estimates (MLE) for the parameters, we suggest using 188 unconstrained optimization to avoid convergence issues in the iterative solution method. To 189 use unconstrained optimization, let . The likelihood now takes the form: 190 191 192 (15) To find the MLE for the parameters, we first find the partial derivatives of the loglikelihood, l, with respect to the coefficients and the dispersion parameter and are given as: 193 194 195 (16) and, . (17) 9 Characterizing the Performance of the Conway-Maxwell Poisson GLM 196 These equations are solved using iteratively re-weighted least squares approaches described 197 in Nelder and Wedderburn (1972), and Wood (2006). 198 To complete the MLE, we must obtain the information matrix at the MLE of the log- 199 likelihood to estimate the standard errors of the coefficient estimates. Sellers and Shmueli 200 (2008) present useful results implied by the likelihood equations for finding the information 201 matrix at the MLE which will later be useful. By setting equations (16) an (17) equal to zero, 202 we see the following: 203 204 (18) and, 205 (19) 206 These results not only permit solution of the MLE using iteratively re-weighted least squares, 207 but also allow the information matrix (Hessian of the log-likelihood) to be solved as a 208 function of the “true” parameters. The information matrix at the MLE of the log-likelihood, 209 I, is the expected second derivative matrix of the log-likelihood: 10 Characterizing the Performance of the Conway-Maxwell Poisson GLM 210 (20) 211 The parts of the information matrix follow (equations 21-23): 212 Iβ : (21) Iβ,ν: (22) 213 214 215 11 Characterizing the Performance of the Conway-Maxwell Poisson GLM 216 217 Iν : (23) 218 219 The information matrix at the MLE may now be used to obtain the standard errors of the 220 regression coefficients. Based on large sample approximation, the general large sample limit 221 for the parameters 222 is (Wood 2006): . (24) 12 Characterizing the Performance of the Conway-Maxwell Poisson GLM 223 Due to the GLM formulation, and our use of the Poisson distribution, the standard errors 224 follow directly from this result (e.g., 225 for each parameter: 226 ), and we can find the confidence intervals . (25) 227 The GLM described above is highly flexible and readily interpreted. It can model under- 228 dispersed datasets, over-dispersed datasets, and datasets that contain intermingled under- 229 dispersed and over-dispersed counts (for dual-link COM-Poisson GLM models only). While 230 we have not presented a dual-link COM-Poisson GLM, in the dual-link case the variance 231 may be allowed to depend on the covariate values, which can be important if high (or low) 232 values of some covariates tend to be variance-decreasing while high (or low) values of other 233 covariates tend to be variance-increasing. The parameters have a direct link to either the 234 mean or the variance, providing insight into the behavior and driving factors in the problem, 235 and the mean and variance of the predicted counts are readily approximated based on the 236 covariate values and regression parameter estimates. 237 3. METHODOLOGY 238 To test the estimation accuracy and computational burden of the MLE implementation of 239 the COM-Poisson GLM of Guikema and Coffelt (2008), we simulated a number of datasets 240 from the COM-Poisson GLM with known regression parameters that correspond to a range 241 of mean and variance values. We then estimated the regression parameters of the COM- 242 Poisson GLM using the MLE implementation and the estimated parameters were then 243 compared to the known parameter values. In this section, we present the procedural details. 13 Characterizing the Performance of the Conway-Maxwell Poisson GLM 244 Data Simulation. In order to characterize the accuracy of the parameter estimates from 245 the COM-Poisson GLM, one hundred datasets, with 1,000 observations each, were randomly 246 generated for each of nine different scenarios corresponding to different levels of dispersion 247 and mean. The nine scenarios include simulated datasets of under-dispersed, equi-dispersed 248 and over-dispersed data. For each level of dispersion, three different sample means were 249 used: high mean (~ 20.0), moderate mean (~ 5.0) and low mean (~ 0.8). Each of these 900 250 datasets was then used as input for the COM-Poisson GLM, and the resulting parameters 251 estimates were compared to the known parameters values that had been used to generate the 252 datasets. 253 Initially, we simulated 1,000 values of the covariates X1 and X2 from a uniform 254 distribution on [0, 1]. The centering parameter µ and shape parameter ν were then generated 255 according to Equations (11) and (12) with known (assigned) regression parameters. Note that 256 the same covariates are used for both the centering and shape parameters. Realizations from 257 the COM-Poisson distribution are then generated using the inverse CDF method (Devroye 258 1986). 259 The regression parameter values were selected in such a way that the shape parameter ν 260 was always set between 0 and 1 for simulating the over-dispersed datasets, above 1 for the 261 under-dispersed datasets and approximately 1 for the equi-dispersed datasets. The parameters 262 that were assigned in simulating the datasets are given in the table below. Table 1 263 summarizes the characteristics of the simulation scenarios. 264 265 Testing Protocol. The coefficients of the COM-POISSON GLM were estimated using the MLE described above, implemented in R (R Core Development Team 2008). 14 Characterizing the Performance of the Conway-Maxwell Poisson GLM 266 4. RESULTS 267 This section consists of an assessment of the performance of the Guikema and Coffelt 268 COM-Poisson GLM for the nine scenarios mentioned above. The results concerning the 269 accuracy of the asymptotic approximation of the mean of the COM-Poisson suggested by 270 Shmueli et al. (2005) are also discussed. 271 Parameter Estimation Accuracy. Table 2 reports the fraction of simulated datasets where 272 the the 95% confidence interval does not contain the “true” parameter value for the parameter 273 estimates. Our results show that, generally, the true coefficient values are contained within 274 the 95% confidence interval with 95% frequency. Some instances where this is not the case 275 are the dispersion parameter under S1 and S7, and the intercept under S3. Estimating the 276 dispersion parameter under the MLE approach presented may be difficult. In only two cases 277 (S2, S6) does the 95% confidence interval contain the true value of the dispersion parameter 278 with greater than 95% frequency. While under most scenarios, the 95% confidence interval 279 contains the true parameter value with greater than 90% frequency, the two highest fractions 280 of simulated datasets where the confidence interval does not contain the “true” parameter 281 value correspond to estimation of the dispersion parameter. The true value of the dispersion 282 parameter may be difficult to ascertain with the level of confidence assumed by the α-level 283 selected. 284 Histograms of the MLE linear and dispersion parameters are plotted and compared with 285 the true parameters in Figures 1-3. Each figure corresponds to a specific dispersion level and 286 each subplot corresponds to a different dataset scenario. Figure 1 illustrates histograms for 287 the over-dispersed scenarios (S1-S3). While the parameter estimates generally seem to be 288 appropriate for all of the linear parameters, β, the dispersion parameter, ν, for the high-mean 15 Characterizing the Performance of the Conway-Maxwell Poisson GLM 289 overdispersed scenario (S1) is overestimated. The dispersion parameter for the moderate and 290 low mean (S2 and S3) scenarios appear to be more well-behaved, and the ranges of the 291 parameter distributions in the low-mean case are much wider (1.0-3.0) than the moderate and 292 high mean cases (0.1-0.5) for each parameter estimated. While this pattern is not replicated 293 for the under and equidispersed scenarios, the dispersion parameter distribution for all 294 scenarios except S3 is wider than those of its attendant linear parameter distributions. For 295 example, Figure 2 illustrates the plots for the under-dispersed scenarios (S4-S6). As with the 296 overdispersed cases, all linear parameter estimates are well behaved. In the underdispersed 297 scenarios, the dispersion parameters are also well-behaved, although the dispersion parameter 298 distribution for the high-mean case (S4) seems slightly skewed to the right. Furthermore, the 299 low-mean scenario (S6) has a wider parameter estimate distribution for each parameter when 300 compared to its high and moderate-mean counterparts. Figure 3 replicates this pattern for the 301 equidispersed scenarios (S7-S9); though the magnitudes of the parameter ranges are 302 generally smaller than those for the overdispersed scenarios (S1-S3), but slightly larger than 303 those for the underdispersed scenarios (S4-S6). 304 Prediction Bias. The bias of an estimator is defined as the difference between an 305 estimator's expected value and the true value of the parameter being estimated. If the bias is 306 zero then the estimator is said to be unbiased. The bias of an estimator 307 E(θ ) − θ where 308 observed data. ∧ is the true value of the parameter and the estimator 16 is calculated as is a function of the Characterizing the Performance of the Conway-Maxwell Poisson GLM 309 The bias of the parameters β and ν under each scenario is calculated as the difference 310 between their average estimates from the 100 simulated datasets and the true (or assigned) 311 value in each scenario. 312 The bias of the centering parameter ‘µ’ is calculated as ∧ 313 E(µ ) − µ = (βˆ 0 − β 0 ) + (βˆ 1 − β1 )X 1 + (βˆ 2 − β 2 )X 2 314 The bias of the dispersion parameter ‘ν’ is calculated as 315 (26) (27) 316 Where and are the bias in the parameters and is the average value of 317 the independent variable. Figure 4 reports the parameter bias for each parameter under each 318 scenario. The overdispersed scenarios are shown in the left panel, the underdispersed in the 319 middle panel, and the equidispersed are shown in the right panel. For the overdispersed 320 scenarios, the bias did not change much for the parameters, except for the intercept parameter 321 under S3. For the underdispersed and equidispersed scenarios, the first two linear parameters 322 and the dispersion parameter demonstrated little sensitivity to the dispersion and mean levels, 323 while the intercept showed a marked sensitivity to these simulated conditions. As the mean 324 decreased in these scenarios, the bias decreased for the intercept. In addition, figure 4 325 suggests that the bias is largest for the overdispersed scenarios relative to the underdispersed 326 and equidispersed scenarios. 327 Figures 5 and 6 report the bias in the mean estimates, given the parameter estimates. 328 First, consider Figure 5. This plot illustrates, for each scenario, the distribution of the bias in 329 the mean for each simulated observation in each dataset under each scenario (N=100,000). 17 Characterizing the Performance of the Conway-Maxwell Poisson GLM 330 These plots suggest that the COM-Poisson GLM is biased in inferences drawn under a low- 331 mean scenario where underdispersion is expected (S3). Furthermore, although the range of 332 the bias distribution is largest for the large-mean equidispersed scenario (S7), the mode of the 333 remaining bias distributions seem close to zero. This observation is corroborated by Figure 334 6. Figure 6 illustrates the prediction accuracy under each scenario. The prediction accuracy 335 scatterplots reflect the information encoded in Figure 5; with the exception of S2 and S3, the 336 COM-Poisson GLM MLE estimator appears to be approximately unbiased. While the COM- 337 Poisson GLM performs better for high and moderate mean for all three categories of 338 dispersion, the moderate and low-mean scenarios under under- and equi-dispersion seem 339 more comparable in their levels of accuracy than the same comparison in the over-dispersed 340 scenarios. For over-dispersed and equi-dispersed datasets, the performance is worse for all 341 low sample mean values. The COM-Poisson GLM works well for all sample mean values for 342 the under-dispersed datasets. 343 Accuracy of the Asymptotic Mean Approximation. The centering parameter µ is 344 believed to adequately approximate the mean when µ >10 based on the asymptotic 345 approximation developed by Shmueli et al. (2005). However, the deviation of µ for mean 346 values below 10 (µ < 10) has not been investigated. We chose one sample from each of the 347 nine scenarios as a basis for estimating the accuracy of the asymptotic mean approximation. 348 First, the µ and ν parameters were calculated from the estimated parameters. We examined 349 the goodness of this approximation by simulating 100,000 random values from the COM- 350 Poisson distribution for a given µ and ν. We then plotted the mean of the simulated values 351 against the asymptotic mean approximation ( 18 ). The results showed that Characterizing the Performance of the Conway-Maxwell Poisson GLM 352 the asymptotic mean approximates the true mean accurately even for 10 > E[Y] > 5. As the 353 sample mean value decreases below 5, the accuracy of the approximation drops. As seen in 354 Figure 7, the asymptotic approximation holds well for all datasets with high and moderate 355 mean values irrespective of the dispersion in the data. The approximation is also accurate for 356 low sample mean values for under-dispersed datasets. The accuracy drops significantly for 357 the low sample mean values for over-dispersed and equi-dispersed datasets. There is not 358 much difference between the asymptotic mean approximation and the true mean for E[Y] 359 > 10. This difference begins to increase slightly at the moderate mean values, although the 360 difference is not high. The difference can clearly be observed for the low sample mean 361 values, particularly for the over-dispersed and equi-dispersed datasets. This shows that one 362 must be careful in using the asymptotic approximation for the mean of the COM-Poisson 363 GLM to estimate future event counts for datasets characterized by low sample mean values. 364 Convergence of S(µ,ν). When using the MLE approach to estimate parameters for the 365 COM-Poisson GLM, it is necessary to compute S(µ,ν), an infinite sum involving the 366 parameters µ and ν, as given in Equation 8. However, for some values of µ and ν, difficulties 367 arise in achieving convergence for the sum. Table 3 summarizes the combinations of µ and ν 368 requiring greater than 170 terms to achieve convergence (defined as ε = 0.0001) in S(µ,ν). 369 These combinations are significant because calculating S(µ,ν) requires computing n!, where n 370 is the index of the term in the infinite sum. Therefore, with combinations of µ and ν 371 requiring summation of greater than 170 terms for convergence, it is necessary to calculate 372 the factorial of numbers larger than 170. However, the built-in factorial function in R, as in 373 most software, cannot be used for numbers greater than 170, because 170! is the largest 374 number that can be represented as an IEEE double precision value (Press et al 2007). Thus, 19 Characterizing the Performance of the Conway-Maxwell Poisson GLM 375 for certain combinations of µ and ν, it is necessary to use an alternate (and likely slower or 376 less precise) algorithm for computing the factorial. However, as shown in Table 3, such 377 difficulties will only arise when using data sets with a very high mean, and are therefore 378 unlikely to be problematic in most applications. In this paper, all combinations of values of µ 379 and ν required 170 or fewer terms to reach convergence. 380 5. DISCUSSION 381 This paper shows that the COM-Poisson GLM is flexible in handling count data 382 irrespective of the dispersion in the data. First, the true parameters lie in the 95% confidence 383 interval for nearly all cases, except as noted above, and are generally close to the estimated 384 value of the parameters. The confidence intervals were found to be wider for the low mean 385 values for both the centering and shape parameters. The bias in the prediction of the 386 parameters and the mean also does not appear to be sensitive to assumptions we have made 387 concerning sample mean and dispersion level, except for the intercept, whose bias decreased 388 as the sample mean decreased under the under- and equi-dispersed scenarios. Even at the low 389 sample mean values, the bias is considerably less for under-dispersed datasets than for over- 390 dispersed and equi-dispersed datasets. Despite its flexibility in handing count data with all 391 dispersions, the COM-Poisson GLM suffers from important limitations for moderate- and 392 low-mean over-dispersed data. The Negative Binomial (Poisson-gamma) models exhibit 393 similar behavior (Lord 2006). Second, the asymptotic approximation of the mean suggested 394 by Shmueli et al (2005) approximates the true mean adequately for E[Y] > 5. This value, 395 obtained by numerical analysis of the COM-Poisson GLM, is substantially lower than the 396 lower bound value of 10 suggested by Shmueli et al. (2005). As the sample mean value 397 decreases, the accuracy of the approximation becomes lower. The asymptotic approximation 20 Characterizing the Performance of the Conway-Maxwell Poisson GLM 398 is accurate for all datasets with high and moderate sample mean values irrespective of the 399 dispersion in the data. The approximation is also accurate for low sample mean values for 400 under-dispersed datasets. However, the accuracy drops substantially for low sample mean 401 values for over-dispersed and equi-dispersed datasets. 402 In summary, this paper has documented the performance of COM-Poisson GLM for 403 datasets characterized by different levels of dispersion and sample mean values. The results 404 of this study showed that the COM-Poisson GLMs can handle under-, equi- and over- 405 dispersed datasets with different mean values, although the confidence intervals are found to 406 be wider for low sample mean values. Despite its limitations for low sample mean values for 407 over-dispersed datasets, the COM-Poisson GLM is still a flexible method for analyzing count 408 data. The asymptotic approximation of the mean is accurate for all datasets with high and 409 moderate sample mean values irrespective of the dispersion in the data, and it is also accurate 410 for low sample mean values for under-dispersed datasets. The COM-Poisson GLM is a 411 promising, flexible regression model for count data. 412 21 Characterizing the Performance of the Conway-Maxwell Poisson GLM 412 REFERENCES 413 ***** (2008), "Extension of the Application of Conway-Maxwell-Poisson Models: 414 Analyzing Traffic Crashes Exhibiting Underdispersion," Working Papers. 415 416 ***** (2009), "Extension of the Application of Conway-Maxwell-Poisson Models: 417 Analyzing Traffic Crash Data Exhibiting Under-Dispersion," Risk Analysis, in press. 418 419 Boatwright, P., Borle, S., and Kadane, J. B. (2003), "A Model of the Joint Distribution of 420 Purchase Quantity and Timing," Journal of the American Statistical Association, 98, 564- 421 572. 422 423 Borle, S., Boatwright, P., and Kadane, J. B. (2006), "The Timing of Bid Placement and 424 Extent of Multiple Bidding: An Empirical Investigation Using Ebay Online Auctions," 425 Statistical Science, 21, 194-205. 426 427 Conway, R. W., and Maxwell, W. L. (1962), "A Queuing Model with State Dependent 428 Service Rates," Journal of Industrial Engineering, 12, 132-136. 429 430 Guikema, S. D., and Coffelt, J. P. (2008), "A Flexible Count Data Regression Model for Risk 431 Analysis.," Risk Analysis, 28, 213-223. 432 22 Characterizing the Performance of the Conway-Maxwell Poisson GLM 433 Jowaheer, V. and N.A. Mamode Khan. 2009. "Estimating Regerssion Effects in COM 434 Poisson Generalized Linear Model," World Academy of Science, Engineering and 435 Technology. 53, 213-223. 436 437 Kadane, J. B., Shmueli, G., Minka, T. P., Borle, S., and Boatwright, P. (2006), "Conjugate 438 Analysis of the Conway-Maxwell-Poisson Distribution," Bayesian Analysis, 1, 363-374. 439 440 Lord, D. (2006), "Modeling Motor Vehicle Crashes Using Poisson-Gamma Models: 441 Examining the Effects of Low Sample Mean Values and Small Sample Size on the 442 Estimation of the Fixed Dispersion Parameter," Accident Analysis & Prevention, 38, 751- 443 766. 444 445 Lord, D., Guikema, S. D., and Geedipally, S. (2008), "Application of the Conway-Maxwell- 446 Poisson Generalized Linear Model for Analyzing Motor Vehicle Crashes," Accident Analysis 447 & Prevention, 40, 1123-1134. 448 449 Mamode Khan, N., and Jowaheer, V., (2010), "A comparison of marginal and joint 450 generalized quasi-likelihood estimating equations based on the Com-Poisson GLM: 451 Application to car breakdowns data," International Journal of Mathematical and Statistical 452 Sciences 2:2. 453 454 Nelder, J. A., and Wedderburn, R. W. M. (1972), "Generalized Linear Models," Journal of 455 the Royal Statistical Society, Series A, 135, 370-384. 23 Characterizing the Performance of the Conway-Maxwell Poisson GLM 456 457 Press, W. H., Teukolsky, S. A., Vetterling, W. T., Flannery, B. P. (2007). Numerical Recipes: 458 The Art of Scientific Computing, New York: Cambridge University Press. 459 460 Ridout, M. S., and Besbeas, P. (2004), "An Empirical Model for Underdispersed Count 461 Data," Statistical Modelling, 4, 77-89. 462 463 Sellers, K. F., and Shmueli, G. (2010), "A Flexible Regression Model for Count Data," 464 Annals of Applied Statistics, In Press. 465 466 Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P. (2005), "A Useful 467 Distribution for Fitting Discrete Data: Revival of the Conway-Maxwell-Poisson 468 Distribution.," Applied Statistics, 54, 127-142. 469 470 R Core Development Team (2008), "R: A Language and Environment for Statistical 471 Computing." 472 473 Rodrigues, J., de Castro, M., Cancho, V. G., Balakrishnan, N. (2009), "COM–Poisson cure 474 rate survival models and an application to a cutaneous melanoma data," Journal of Statistical 475 Planning and Inference, 139, 3605-3611. 476 477 Telang, R., Boatwright, P., and Mukhopadhyay, T. (2004), "A Mixture Model for Internet 478 Search-Engine Visits," Journal of Marketing, 41, 206-214. 24 Characterizing the Performance of the Conway-Maxwell Poisson GLM 479 480 Wood, S. N. (2006), Generalized Additive Models: An Introduction with R, eds. B. P. Carlin, 481 C. Chatfield, M. A. Tanner and J. Zidek, New York: Chapman and Hall/CRC. 482 483 25 Characterizing the Performance of the Conway-Maxwell Poisson GLM 483 TABLES AND FIGURES 484 485 Table 1: Assigned parameters for the case study. The scenarios are labeled S1 through S9. Over-dispersed data β0 β1 β2 ν0 Under-dispersed data Equi-dispersed data High mean S1 Moderate Mean S2 Low mean S3 High mean S4 Moderate Mean S5 Low mean S6 High mean S7 Moderate Mean S8 Low mean S9 3.0 1.3 -2.0 3.0 1.7 0.2 3.0 1.7 0.2 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.15 -0.4 -1.3 -1.3 1.0 1.0 1.2 0 0 0 486 487 488 Table 2: Fraction of cases where "true" parameter values do not fall in confidence interval. Dispersion Overdispersed Underdispersed Equidispersed 489 490 491 Scenario Hi Mean (S1) Mod Mean (S2) Lo Mean (S3) Hi Mean (S4) Mod Mean (S5) Lo Mean (S6) Hi Mean (S7) Mod Mean (S8) Lo Mean (S9) β0 0.08 0.03 0.22 0.05 0.04 0.04 0.06 0.05 0.04 β1 0.03 0.04 0.01 0.04 0.01 0.04 0.06 0.03 0.04 β2 0.02 0.03 0.04 0.04 0.05 0.05 0.01 0.04 0.03 ν 0.45 0.03 0.08 0.07 0.09 0.05 0.35 0.1 0.06 Table 3: Combinations of λ and ν values requiring greater than 170 terms for convergence of Z(λ,ν) (Equation 2). µ ν ≥ 190 0.05 ≥ 110 0.10 ≥ 102 0.15 ≥ 98 0.20 ≥ 105 0.25 ≥ 110 0.30 ≥ 111 0.35 ≥ 112 0.40 ≥ 113 0.45 ≥ 117 0.50 ≥ 118 0.55 ≥ 120 0.60 ≥ 121 0.65 492 493 494 26 Characterizing the Performance of the Conway-Maxwell Poisson GLM 494 495 496 497 Figure 1: Histograms for parameter estimates in overdispersed scenarios. S1 (highmean) first row, S2 (moderate mean) second row, S3 (low mean) third row. "True" parameter values indicated by vertical red lines. 498 499 27 Characterizing the Performance of the Conway-Maxwell Poisson GLM 499 500 501 502 Figure 2: Histograms for parameter estimates in underdispersed scenarios. S4 (highmean) first row, S5 (moderate mean) second row, S6 (low mean) third row. "True" parameter values indicated by vertical red lines. 503 504 28 Characterizing the Performance of the Conway-Maxwell Poisson GLM 504 505 506 507 Figure 3: Histograms for parameter estimates in equidispersed scenarios. S7 (highmean) first row, S8 (moderate mean) second row, S9 (low mean) third row. "True" parameter values indicated by vertical red lines. 508 509 29 Characterizing the Performance of the Conway-Maxwell Poisson GLM 509 510 511 Figure 4: Prediction bias for parameter estimates under each scenario. β 0 blue diamonds, β 1 red squares, β 2 yellow triangles, ν indicated by “x” on green line. 512 30 Characterizing the Performance of the Conway-Maxwell Poisson GLM 512 513 514 Figure 5: Prediction accuracy histograms for overdispersed (top row, S1-S3), underdispersed (middle row, S4-S6), and equidispersed (bottom row, S7-S9) datasets. 515 31 Characterizing the Performance of the Conway-Maxwell Poisson GLM 515 516 517 Figure 6: Prediction accuracy scatterplots for overdispersed (top row, S1-S3), underdispersed (middle row, S4-S6), and equidispersed (bottom row, S7-S9) datasets. 32 Characterizing the Performance of the Conway-Maxwell Poisson GLM 518 Figure 7: Mean versus asymptotic approximation for each scenario. 519 33