View in new window - Sawtooth Software
Transcription
View in new window - Sawtooth Software
PROCEEDINGS OF THE SAWTOOTH SOFTWARE CONFERENCE October 2013 Copyright 2014 All rights reserved. No part of this volume may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from Sawtooth Software, Inc. iv FOREWORD These proceedings are a written report of the seventeenth Sawtooth Software Conference, held in Dana Point, California, October 16–18, 2013. Two-hundred ten attendees participated. This conference included a separate Healthcare Applications in Conjoint Analysis track; however these proceedings contain only the papers delivered at the Sawtooth Software Conference. The focus of the Sawtooth Software Conference continues to be quantitative methods in marketing research. The authors were charged with delivering presentations of value to both the most sophisticated and least sophisticated attendees. Topics included choice/conjoint analysis, surveying on mobile platforms, Menu-Based Choice, MaxDiff, hierarchical Bayesian estimation, latent class procedures, optimization routines, cluster ensemble analysis, and random forests. The papers and discussant comments are in the words of the authors and very little copy editing was performed. At the end of each of the papers, we’re pleased to display photographs of the authors and co-authors who attended the conference. We appreciate their cooperation to sit for these portraits! It lends a personal touch and makes it easier for readers to recognize them at the next conference. We are grateful to these authors for continuing to make this conference a valuable event and advancing our collective knowledge in this exciting field. Sawtooth Software June, 2014 vi CONTENTS 9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS .................................................1 Chris Chapman, Google QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A TRADITIONAL MANUFACTURING FIRM: UPDATE AND CASE STUDY .................................................................................................... 13 Robert J. Goodwin, Lifetime Products, Inc. CAN CONJOINT BE FUN?: IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS ........................................... 39 Jane Tang & Andrew Grenville, Vision Critical MAKING CONJOINT MOBILE: ADAPTING CONJOINT TO THE MOBILE PHENOMENON ........... 55 Chris Diener, Rajat Narang, Mohit Shant, Hem Chander & Mukul Goyal, AbsolutData CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS ........................................................ 69 Joseph White, Maritz Research USING COMPLEX CHOICE MODELS TO DRIVE BUSINESS DECISIONS ..................................... 83 Karen Fuller, HomeAway, Inc. & Karen Buros, Radius Global Market Research AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY ....................................... 97 Brent Fuller, Matt Madden & Michael Smith, The Modellers MAXDIFF AUGMENTATION: EFFORT VS. IMPACT ............................................................... 105 Urszula Jones, TNS & Jing Yeh, Millward Brown WHEN U = ßX IS NOT ENOUGH: MODELING DIMINISHING RETURNS AMONG CORRELATED CONJOINT ATTRIBUTES ...................................................................................................... 115 Kevin Lattery, Maritz Research RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE? A VARIANCE DECOMPOSITION OF HB UTILITIES ................................................................ 129 Keith Chrzan & Aaron Hill, Sawtooth Software FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO CREATE VALUE ............... 135 Karlan Witt & Deb Ploskonka, Cambia Information Group BRAND IMAGERY MEASUREMENT: ASSESSMENT OF CURRENT PRACTICE AND A NEW APPROACH .............................................................................................................. 147 Paul Richard McCullough, MACRO Consulting, Inc. i ACBC REVISITED ............................................................................................................. 165 Marco Hoogerbrugge, Jeroen Hardon & Christopher Fotenos, SKIM Group RESEARCH SPACE AND REALISTIC PRICING IN SHELF LAYOUT CONJOINT (SLC) ................. 181 Peter Kurz, TNS Infratest, Stefan Binner, bms marketing research + strategy & Leonhard Kehl, Premium Choice Research & Consulting ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS ..................................... 195 Dan Yardley, Maritz Research ANCHORED ADAPTIVE MAXDIFF: APPLICATION IN CONTINUOUS CONCEPT TEST ............... 205 Rosanna Mau, Jane Tang, LeAnn Helmrich & Maggie Cournoyer, Vision Critical HOW IMPORTANT ARE THE OBVIOUS COMPARISONS IN CBC? THE IMPACT OF REMOVING EASY CONJOINT TASKS ........................................................... 221 Paul Johnson & Weston Hadlock, SSI SEGMENTING CHOICE AND NON-CHOICE DATA SIMULTANEOUSLY ................................... 231 Thomas C. Eagle, Eagle Analytics of California EXTENDING CLUSTER ENSEMBLE ANALYSIS VIA SEMI-SUPERVISED LEARNING ...................... 251 Ewa Nowakowska, GfK Custom Research North America & Joseph Retzer, CMI Research, Inc. THE SHAPLEY VALUE IN MARKETING RESEARCH: 15 YEARS AND COUNTING ....................... 267 Michael Conklin & Stan Lipovetsky, GfK DEMONSTRATING THE NEED AND VALUE FOR A MULTI-OBJECTIVE PRODUCT SEARCH......... 275 Scott Ferguson & Garrett Foster, North Carolina State University A SIMULATION BASED EVALUATION OF THE PROPERTIES OF ANCHORED MAXDIFF: STRENGTHS, LIMITATIONS AND RECOMMENDATIONS FOR PRACTICE.................................... 305 Jake Lee, Maritz Research & Jeffrey P. Dotson, Brigham Young University BEST-WORST CBC CONJOINT APPLIED TO SCHOOL CHOICE: SEPARATING ASPIRATION FROM AVERSION ........................................................................ 317 Angelyn Fairchild, Research, RTI International, Namika Sagara & Joel Huber, Duke University DOES THE ANALYSIS OF MAXDIFF DATA REQUIRE SEPARATE SCALING FACTORS? .............. 331 Jack Horne & Bob Rayner, Market Strategies International ii USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF PRODUCT FEATURES .... 341 Greg Allenby, Ohio State University, Jeff Brazell, The Modellers, John Howell, Penn State University & Peter Rossi, University of California Los Angeles THE BALLAD OF BEST AND WORST ...................................................................................... 357 Tatiana Dyachenko, Rebecca Walker Naylor & Greg Allenby, Ohio State University iii iv SUMMARY OF FINDINGS The seventeenth Sawtooth Software Conference was held in Dana Point, California, October 16–18, 2013. The summaries below capture some of the main points of the presentations and provide a quick overview of the articles available within the 2013 Sawtooth Software Conference Proceedings. 9 Things Clients Get Wrong about Conjoint Analysis (Chris Chapman, Google): Conjoint analysis has been used with great success in industry, Chris explained, but this often leads to some clients having misguided expectations regarding the technique. As a prime example, many clients are hoping that conjoint analysis will predict the volume of demand. While conjoint can provide important input to a forecasting model, it usually cannot alone predict volume without other inputs such as awareness, promotion, channel effects and competitive response. Chris cautioned about examining average part-worth utility scores only without consideration for the distribution of preferences (heterogeneity), which often reveal profitable niche strategies. He also recommended fielding multiple studies with modest sample sizes that examine a business problem using different approaches rather than fielding one high-budget, large sample size survey. Finally, Chris stressed that leveraging insights from analytics (such as from conjoint) is better than relying solely on managerial instincts. It will generally increase the upside potential and reduce the downside risk for business decisions. Quantitative Marketing Research Solutions in a Traditional Manufacturing Firm: Update and Case Study (Robert J. Goodwin, Lifetime Products, Inc.): Bob’s presentation highlighted the history of Lifetime’s use of conjoint methods to help it design and market its consumer-oriented product line. He also presented findings regarding a specific test involving Adaptive CBC (ACBC). Regarding practical lessons learned while executing numerous conjoint studies at Lifetime, Bob cited not overloading the number of attributes in the list just because the software can support them. Some attributes might be broken out and investigated using nonconjoint questions. Also, because many respondents do not care much about brand name in the retailing environment that Lifetime engages, Bob has dropped brand from some of his conjoint studies. But, when he wants to measure brand equity, Bob uses a simulation method to estimate the value of a brand, in the context of competitive offerings and the “None” alternative. Finally, Bob conducted a split-sample test involving ACBC. He found that altering the questionnaire settings to focus the “near-neighbor design” either more tightly or less tightly around the respondent’s BYO-specified concept didn’t change results much. This, he argued, demonstrates the robustness of ACBC results to different questionnaire settings. Can Conjoint Be Fun?: Improving Respondent Engagement in CBC Experiments (Jane Tang and Andrew Grenville, Vision Critical): Traditional CBC can be boring for respondents. This was mentioned in a recent Greenbook blog. Jane gave reasons why we should try to engage respondents in more interesting surveys, such as a) cleaner data often result, and b) happy respondents are happy panelists (and panelist cooperation is key). Ways to make surveys more fun include using adaptive tasks that seem to listen and respond to respondent preferences as well as feedback mechanisms that report something back to respondents based on their preferences. Jane and her co-author Andrew did a split-sample test to see if adding adaptive tasks and a feedback mechanism could improve CBC results. They employed tournament tasks, wherein concepts that win in earlier tasks are displayed again in later tasks. They also employed a simple level-counting mechanism to report back the respondent’s preferred product concept. v Although their study design didn’t include good holdouts to examine predictive validity, there was at least modest evidence that the adaptive CBC design had lower error and performed better. Also, some qualitative evidence suggested that respondents preferred the adaptive survey. After accounting for scale differences (noise), they found very few differences in the utility parameters for respondents receiving “fun” versus standard CBC surveys. In sum, Jane suggested that if the utility results are essentially equivalent, why not let the respondent have more fun? Making Conjoint Mobile: Adapting Conjoint to the Mobile Phenomenon (Chris Diener, Rajat Narang, Mohit Shant, Hem Chander, and Mukul Goyal, AbsolutData): Chris and his coauthors examined issues involving the use of mobile devices to complete complex conjoint studies. Each year more respondents are choosing to complete surveys using mobile devices, so this topic should interest conjoint analysis researchers. It has been argued that the small screen size for mobile devices may make it nearly impossible to conduct complex conjoint studies involving relatively large lists of attributes. The authors conducted a split-sample experiment involving the US and India using five different kinds of conjoint analysis variants (all sharing the same attribute list). The variants included standard CBC, partial-profile, and adaptive methods (ACBC). Chris found only small differences in the utilities or the predictive validity for PCcompleted surveys versus mobile-completed surveys. Surprisingly, mobile respondents generally reported no readability issues (ability to read the questions and concepts on the screen) compared to PC respondents. The authors concluded that conjoint studies, even those involving nine attributes (as in their example) can be done effectively among those who elect to complete the surveys using their mobile devices (providing researchers keep the surveys short, use proper conjoint questionnaire settings, and emphasize good aesthetics). Choice Experiments in Mobile Web Environments (Joseph White, Maritz Research): Joseph looked at the feasibility of conducting two separate eight-attribute conjoint analysis studies using PC, tablet, or mobile devices. He compared results based on the Swait-Louviere test, allowing him to examine results in terms of scale and parameter equivalence. He also examined internal and external fit criteria. His conjoint questionnaires included full-profile and partial-profile CBC. Across both conjoint studies, he concluded that respondents who choose to complete the studies via a mobile device show predictive validity at parity with, or better than, those who choose to complete the same study via both PC and Tablet. Furthermore, the response error for mobile-completed surveys is on par with PC. He summed it up by stating, “Consistency of results in both studies indicate that even more complicated discrete choice experiments can be readily completed in mobile computing environments.” Using Complex Choice Models to Drive Business Decisions (Karen Fuller, HomeAway, Inc. and Karen Buros, Radius Global Market Research): Karen Fuller and Karen Buros jointly presented a case study involving a complex menu-based choice (MBC) experiment for Fuller’s company, HomeAway. HomeAway offers an online marketplace for vacation travelers to find rental properties. Vacation home owners and property managers list rental property on HomeAway’s website. The challenge for HomeAway was to design the pricing structure and listing options to better support the needs of owners and to create a better experience for travelers. Ideally, this would also increase revenues per listing. They developed an online questionnaire that looked exactly like HomeAway’s website, including three screens to fully select all options involving creating a listing. This process nearly replicated HomeAway’s existing enrollment process (so much so that some respondents got confused regarding whether they had completed a survey or done the real thing). Nearly 2,500 US-based respondents vi completed multiple listings (MBC tasks), where the options and pricing varied from task to task. Later, a similar study was conducted in Europe. CBC software was used to generate the experimental design, the questionnaire was custom-built, and the data were analyzed using MBC software. The results led to specific recommendations for management, including the use of a tiered pricing structure, additional options, and to increase the base annual subscription price. After implementing many of the suggestions of the model, HomeAway has experienced greater revenues per listing and the highest renewal rates involving customers choosing the tiered pricing. Augmenting Discrete Choice Data—A Q-sort Case Study (Brent Fuller, Matt Madden, and Michael Smith, The Modellers): Sometimes clients want to field CBC studies that have an attribute with an unusually large number of levels, such as messaging and promotion attributes. The problem with such attributes is obtaining enough precision to avoid illogical reversals while avoiding excessive respondent burden. Mike and Brent presented an approach to augmenting CBC data with Q-sort rankings for these attributes involving many levels. A Q-sort exercise asks respondents to sort items into a small number of buckets, but where the number of items assigned per bucket is fixed by the researcher. The information from the Q-sort can be appended to the data as a series of inequalities (e.g., level 12 is preferred to level 18) constructed as new choice tasks. Mike and Brent found that the CBC data without Q-sort augmentation had some illogical preference orderings for the 19-level attribute. With augmentation, the reversals disappeared. One problem with augmenting the data is that it can artificially inflate the importance of the augmented attribute relative to the non-augmented attributes. Solutions to this problem include scaling back the importances to the original importances (at the individual level) given from HB estimation of the CBC data prior to augmentation. MaxDiff Augmentation: Effort vs. Impact (Urszula Jones, TNS and Jing Yeh, Millward Brown): Ula (Urszula) and Jing described the common challenge that clients want to use MaxDiff to test a large number of items. With standard rules of thumb (to obtain stable individual-level estimates), the number of choice tasks becomes very large per respondent. Previous solutions presented at the Sawtooth Software Conference include augmenting the data using Q-Sort (Augmented MaxDiff), Express MaxDiff (each respondents sees only a subset of the items), or Sparse MaxDiff (each respondent sees each item fewer than three times). Ula and Jing further investigated whether Augmented MaxDiff was worth the additional survey programming effort (as it is the most complicated) or whether the other approaches were sufficient. Although the authors didn’t implement holdout tasks that would have given a better read on predictive validity of the different approaches, they did draw some conclusions. They concluded a) at the individual level, Sparse MaxDiff is not very precise, but in aggregate the results are accurate, b) If you have limited time, augmenting data using rankings of the top items is probably better than augmenting the bottom items, and c) Augmenting on both top and bottom items is best if you need accurate individual-level results for TURF or clustering. When U = ßx Is Not Enough: Modeling Diminishing Returns among Correlated Conjoint Attributes (Kevin Lattery, Maritz Research): When conjoint analysis studies involve a large number of binary (on/off) features, standard conjoint models tend to over-predict interest in product concepts loaded up with nearly all the features and to under-predict product concepts including very few of the features. This occurs because typical conjoint analysis (using main effects estimation) assumes all the attributes are independent. But, Kevin explained, there often are diminishing returns when bundling multiple binary attributes (though the problem isn’t vii limited to just binary attributes). Kevin reviewed some design principles involving binary attributes (avoid the situation in which the same number of “on” levels occurs in each product concept). Next, Kevin discussed different ways to account for the diminishing returns among a series of binary items. Interaction effects can partially solve the problem (and are a stable and practical solution if the number of binary items is about 3 or fewer). Another approach is to introduce a continuous variable for the number of “on” levels within a concept. But, Kevin proposed a more complete solution that borrows from nested logit. He demonstrated greater predictive validity to holdouts for the nested logit approach than the approach of including a term representing the number of “on” levels in the concept. One current drawback, he noted, is that his solution may be difficult to implement with HB estimation. Respondent Heterogeneity, Version Effects or Scale? A Variance Decomposition of HB Utilities (Keith Chrzan and Aaron Hill, Sawtooth Software): When researchers use CBC or MaxDiff, they hope the utility scores are independent of which version (block) each respondent received. However, one of the authors, Keith Chrzan, saw a study a few years ago in which assignment in cluster membership was not independent of questionnaire version (>95% confidence). This lead to further investigation in which for more than half of the examined datasets, the authors found a statistically significant version effect upon final estimated utilities (under methods such as HB). Aaron (who presented the research at the conference) described a regression model that they built which explained final utilities as a function of a) version effect, b) scale effect (response error), and c) other. The variance captured in the “other” category is assumed to be the heterogeneous preferences of respondents. Across multiple datasets, the average version effect accounted for less than 2% of the variance in final utilities. Scale accounted for about 11%, with the remaining attributable to substantive differences in preferences across respondents or other unmeasured sources. Further investigation using synthetic respondents led the authors to conclude that the version effect was psychological rather than algorithmic. They concluded that although the version effect is statistically significant, it isn’t strong enough to really worry about for practical applications. Fusing Research Data with Social Media Monitoring to Create Value (Karlan Witt and Deb Ploskonka, Cambia Information Group): Karlan and Deb described the current business climate, in which social media provides an enormous volume of real-time feedback about brand health and customer engagement. They recommended that companies should fuse social media measurement with other research as they design appropriate strategies for marketing mix. The big question is how best to leverage the social media stream, especially to move beyond the data gathering and summary stages to actually using the data to create value. Karlan and Deb’s approach starts with research to identify the key issues of importance to different stakeholders within the organization. Next, they determine specific thresholds (from social media metrics) that would signal to each stakeholder (for whom the topic is important) that a significant event was occurring that required attention. Following the event, the organization can model the effect of the event on Key Performance Indicators (KPIs) by different customer groups. The authors presented a case study to illustrate the principles. Brand Imagery Measurement: Assessment of Current Practice and a New Approach (Paul Richard McCullough, MACRO Consulting, Inc.): Dick (Richard) reviewed the weaknesses in current brand imagery measurement practices, specifically the weaknesses of the rating scale (lack of discrimination, scale use bias, halo). A new approach, brand-anchored MaxDiff, removed halo, avoids scale use bias, and is more discriminating. The process involves showing viii the respondent a brand directly above a MaxDiff question involving, say, 4 or 5 imagery items. Respondents indicate which of the items most describes the brand and which least describes the brand. Anchored scaling MaxDiff questions (to estimate a threshold anchor point) allow comparisons across brands and studies. But, anchored scaling re-introduces some scale use bias. Dick tried different approaches to reduce the scale use bias associated with anchored MaxDiff using an empirical study. Part of the empirical study involved different measures of brand preference. He found that MaxDiff provided better discrimination, better predictive validity (of brand preference), and greater reduction of brand halo and scale use bias than traditional ratingsbased measures of brand imagery. Ratings provided no better predictive validity of brand preference in his model than random data. However, the new approach also took more respondent time and had higher abandonment rates. ACBC Revisited (Marco Hoogerbrugge, Jeroen Hardon, and Christopher Fotenos, SKIM Group): Christopher and his co-authors reviewed the stages in an Adaptive CBC (ACBC) interview and provided their insights into why ACBC has been a successful conjoint analysis approach. They emphasized that ACBC has advantages with more complex attribute lists and markets. The main thrust of their paper was to test different ACBC interviewing options, including a dynamic form of CBC programmed by the SKIM Group. They conducted a splitsample study involving choices for televisions. They compared default CBC and ACBC questionnaires to modifications of ACBC and CBC. Specifically, they investigated whether dropping the “screener” section in ACBC would hurt results; using a smaller random shock within summed pricing; whether to include price or not in ACBC’s unacceptable questions; and the degree to which ACBC samples concepts directly around the BYO-selected concept. For the SKIM-developed dynamic CBC questionnaire, the first few choice tasks were exactly like a standard CBC task. The last few tasks displayed winning concepts chosen in the first few tasks. In terms of prediction of the holdout tasks, all ACBC variants did better than the CBC variants. None of the ACBC variations seemed to make much difference, suggesting that the ACBC procedure is quite robust even with simplifications such as removing the screening section. Research Space and Realistic Pricing in Shelf Layout Conjoint (SLC) (Peter Kurz, TNS Infratest, Stefan Binner, bms marketing research + strategy, and Leonhard Kehl, Premium Choice Research & Consulting): In the early 1990s, the first CBC questionnaires only displayed a few product concepts on the screen, without the use of graphics. Later versions supported shelflooking displays, complete with graphics and other interactive elements. Rather than using lots of attributes described in text, the graphics themselves portrayed different sizes, claims, and package design elements. However, even the most sophisticated computerized CBC surveys (including virtual-reality) cannot reflect the real situation of a customer at the supermarket. The authors outlined many challenges involving shelf layout conjoint (SLC). Some of the strengths of SLC, they suggested, are in optimization of assortment (e.g., line extension problems, substitution) and price positioning/promotions. Certain research objectives are problematic for SLC, including volumetric predictions, positioning of products on the shelf, and new product development. The authors concluded by offering specific recommendations for improving results when applying SLC, including: use realistic pricing patterns and ranges within the tasks, use realistic tag displays, and reducing the number of parameters to estimate within HB models. Attribute Non-Attendance in Discrete Choice Experiments (Dan Yardley, Maritz Research): When respondents ignore certain attributes when answering CBC tasks, this is called “attribute non-attendance.” Dan described how some researchers in the past have asked ix respondents directly which attributes they ignored (stated non-attendance) and have used that information to try to improve the models. To test different approaches to dealing with nonattendance, Dan conducted two CBC studies. The first involved approximately 1300 respondents, using both full- and partial-profile CBC. The second involved about 2000 respondents. He examined both aggregate and disaggregate (HB) models in terms of model fit and out-of-sample holdout prediction. Dan also investigated ways to try to ascertain from HB utilities that respondents were ignoring certain attributes (rather than rely on stated nonattendance). For attributes deemed to have been ignored by a respondent, the codes in the independent variable matrix were held constant at zero. He found that modeling stated nonattendance had little impact on the results, but usually slightly reduced the fit to holdouts. He experimented with different cutoff rates under HB modeling to deduce whether individual respondents had ignored attributes. For his two datasets, he was able to slightly improve prediction of holdouts using this approach. Anchored Adaptive MaxDiff: Application in Continuous Concept Test (Rosanna Mau, Jane Tang, LeAnn Helmrich, and Maggie Cournoyer, Vision Critical): Rosanna and her coauthors investigated the feasibility of using an adaptive form of anchored MaxDiff within multiwave concept tests as a replacement for traditional 5-point rating scales. Concept tests have traditionally been done with 5-point scales, with the accompanying lack of discrimination and scale use bias. Anchored MaxDiff has proven to have superior discrimination, but the stability of the anchor (the buy/no buy threshold) has been called into question in previous research presented at the Sawtooth Software conference. Specifically, the context of how many concepts are being evaluated within the direct anchoring approach can affect the absolute position of the anchor. This would be extremely problematic for using the anchored MaxDiff approach to compare the absolute desirability of concepts across multiple waves of research that involve differing numbers and quality of product concepts. To reduce the context effect for direct anchor questions, Rosanna and her co-authors used an Adaptive MaxDiff procedure to obtain a rough rank-ordering of items for each respondent. Then, in real-time, they asked respondents binary purchase intent questions for six items ranging along the continuum of preference from the respondent’s best to the respondent’s worst items. They compared results across multiple waves of data collection involving different numbers of product concepts. They found good consistency across waves and that the MaxDiff approach led to greater discrimination among the top product concepts than the ratings questions. How Important Are the Obvious Comparisons in CBC? The Impact of Removing Easy Conjoint Tasks (Paul Johnson and Weston Hadlock, SSI): One well-known complaint about CBC questionnaires is that they can often display obvious comparisons (dominated concepts) within a choice task. Obvious comparisons are those which the respondent recognizes that one concept is logically inferior in every way to another concept. After encountering a conjoint analysis study where a full 60% of experimentally designed choice tasks included a logically dominated concept, Paul and Weston decided to experiment on the effect of removing dominated concepts. They fielded that same study among 500 respondents, where half the sample received the typically designed CBC tasks and the other half received CBC tasks wherein the authors removed any tasks including dominated concepts, replacing them tasks without dominated concepts (by modifying the design file in Excel). They found little differences between the two groups in terms of predictability of holdout tasks or length of time to complete the CBC questionnaire. They asked some follow-up qualitative questions regarding that survey taking experience and found no significant differences between the two groups of respondents. Paul and x Weston concluded that if it requires extra effort on the part of the researcher to modify the experimental design to avoid dominated concepts, then it probably isn’t worth the extra effort in terms of quality of the results or respondent experience. Segmenting Choice and Non-Choice Data Simultaneously (Thomas C. Eagle, Eagle Analytics of California): This presentation focused on how to leverage both choice data (such as CBC or MaxDiff) and non-choice data (other covariates, whether nominal or continuous) to develop effective segmentation solutions. Tom compared and contrasted two common approaches: a) first estimating individual-level utility scores using HB and then using those scores plus non-choice data as basis variables within cluster analysis, or b) simultaneous utility estimation leveraging choice and non-choice data using latent class procedures, specifically Latent Gold software. Tom expressed that he worries about the two-step procedure (HB followed by clustering), for at least two reasons: first, errors in the first stage are taken as given in the second stage; and second, HB involves prior assumptions of population normality, leading to at least some degree of Bayesian smoothing to the mean—which is at odds with the notion of forming distinct segments. Using simulated data sets with known segmentation structure, Tom compared the two approaches. Using the two-stage approach leads to the additional complication of needing to somehow normalize the scores for each respondent to try to remove the scale confound. Also, there are issues involving setting HB priors that affect the results, but aren’t always clear to the researcher regarding which settings to invoke. Tom also found that whether using the two-stage approach or the simultaneous one-stage approach, the BIC criterion often failed to point to the correct number of segments. He commented that if a clear segmentation exists (wide separation between groups and low response error), almost any approach will find it. But, any segmentation algorithm will find patterns in data even if meaningful patterns do not exist. Extending Cluster Ensemble Analysis via Semi-Supervised Learning (Ewa Nowakowska, GfK Custom Research North America and Joseph Retzer, CMI Research, Inc.): Ewa and Joseph’s work focused on obtaining not only high quality segmentation results, but actionable ones, where actionable is defined as having particular managerial relevance (such as discriminating between intenders/non-intenders). They also reviewed the terminology of unsupervised vs. supervised learning. Unsupervised learning involves discovering latent segments in data using a series of basis variables (e.g., cluster algorithms). Supervised learning involves classifying respondents into specific target outcomes (e.g., purchasers and nonpurchasers), such as logistic regression, CART, Neural Nets, and Random Forests. Semisupervised learning combines aspects of supervised and unsupervised learning to find segments that are of high quality (in terms of discrimination among basis variables) and actionable (in terms of classifying respondents into categories of managerial interest). Ewa and Joe’s main tools to do this were Random Forests (provided in R) and Sawtooth Software’s CCEA (Convergent Cluster & Ensemble Analysis). The authors used the multiple solutions provided by Random Forests to compute a respondent-by-respondent similarities matrix (based on how often respondents ended up within the same terminal node). They employed hierarchical clustering analysis to develop cluster solutions based on the similarities data. These cluster solutions were combined with standard unsupervised cluster solutions (developed on the basis variables) to create ensembles of segmentation solutions, which CCEA software in turn used to create a high quality and actionable consensus cluster solution. Ewa and Joe wrapped it up by showing a webbased simulator that assigns respondents into segments based on responses to basis variables. xi The Shapley Value in Marketing Research: 15 Years and Counting (Michael Conklin and Stan Lipovetsky, GfK): Michael (supported by co-author Stan) explained that Shapley Value not only is an extension of standard TURF analysis, but it can be applied in numerous other marketing research problems. The Shapley Value derives from game theory. In simplest terms, one can think about the value that a hockey player provides to a team (in terms of goals scored by the team per minute) when this player is on the ice. For marketing research TURF problems, the Shapley Value is the unique value contributed by a flavor or brand within a lineup when considering all possible lineup combinations. As one possible extension, Michael described a Shapley Value model to predict share of choice for SKUs on a shelf. Respondents indicate which SKUs are in the consideration set and the Shapley Value is computed (across thousands of possible competitive sets) for each SKU. This value considers the likelihood that the SKU is in the consideration set and importantly the likelihood that the SKU is chosen within each set (equal to 1/n for each respondent, where n is the number of items in the consideration set). The benefits of this simple model of consumer behavior are that it can accommodate very large product categories (many SKUs) and it is very inexpensive to implement. The drawback is that each SKU must be a complete, fixed entity on its own (not involving varying attributes, such as prices). As yet another field for application of Shapley Value, Michael spoke of its use in drivers analysis (rather than OLS or other related techniques). However, Michael emphasized that he thought the greatest opportunities for Shapley Value in marketing research lie in the product line optimization problems. Demonstrating the Need and Value for a Multi-objective Product Search (Scott Ferguson and Garrett Foster, North Carolina State University): Scott and Garrett reviewed the typical steps involved in optimization problems for conjoint analysis, including estimating respondent preferences via a conjoint survey and gathering product feature costs. Usually, such optimization tasks involve setting a single goal, such as optimization of share of preference, utility, revenue, or profit. However, there is a set of solutions on an efficient frontier that represent optimal mixes of multiple goals, such as profit and share of preference. For example, two solutions may be very similar in terms of profit, but the slightly lower profit solution may provide a large gain in terms of share of preference. A multi-objective search algorithm reports dozens or more results (product line configurations) to managers along the efficient frontier (among multiple objectives) for their consideration. Again, those near-optimal solutions represent different mixes of multiple objectives (such as profit and share of preference). Scott’s application involved genetic algorithms. More than two objectives might be considered, Scott elaborated, for instance profit, share of preference, and likelihood to be purchased by a specific respondent demographic. One of the keys to being able to explore the variety of near-optimal solutions, Scott emphasized, was the use of software visualization tools. A Simulation Based Evaluation of the Properties of Anchored MaxDiff: Strengths, Limitations and Recommendations for Practice (Jake Lee, Maritz Research and Jeffrey P. Dotson, Brigham Young University): Jake and Jeff conducted a series of simulation studies to test the properties of three methods for anchored MaxDiff: direct binary, indirect (dual response), and the status quo method. The direct approach involves asking respondents if each item is preferred to (more important than) the anchor item (where the anchor item is typically a buy/no buy threshold or important/not important threshold). The indirect dual-response method involves asking (after each MaxDiff question) if all the items shown are important, all are not important, or some are important. The status quo approach involves adding a new item to the item list that indicates the status quo state (e.g., no change). Jake and Jeff’s simulation studies examined data xii situations in which respondents were more or less consistent along with whether the anchor position was at the extreme of the scale or near the middle. They concluded that under most realistic conditions, all three methods work fine. However, they recommended that the status quo method be avoided if all items are above or below the threshold. They also reported that if respondent error was especially high, the direct method should be avoided (though this usually cannot be known ahead of time). Best-Worst CBC Conjoint Applied to School Choice: Separating Aspiration from Aversion (Angelyn Fairchild, RTI International, Namika Sagara and Joel Huber, Duke University): Most CBC research today asks respondents to select the best concept within each set. Best-Worst CBC involves asking respondents to select both the best and worst concepts within sets of at least three concepts. School choice involves both positive and negative reactions to features, so it naturally would seem a good topic for employing best-worst CBC. Joel and his co-authors fielded a study among 150 parents with children entering grades 6–11. They used a gradual, systematic introduction of the attributes to respondents. Before beginning the CBC task, they asked respondents to select the level within each attribute that best applied to their current school; then, they used warm-up tradeoff questions that showed just a few attributes at a time (partial-profile). When they compared results from best-only versus worst-only choices, they consistently found smaller utility differences between the best two levels for best-only choices. They also employed a rapid OLS-based utility estimation for on-the-fly estimation of utilities (to provide real-time feedback to respondents). Although the simple method is not expected to provide as accurate results as an HB model run on the entire dataset, the individual-level results from the OLS estimation correlated quite strongly with HB results. They concluded that if the decision to be studied involves both attraction and avoidance, then a Best-Worst CBC approach is appropriate. Does the Analysis of MaxDiff Data Require Separate Scaling Factors? (Jack Horne and Bob Rayner, Market Strategies International): The traditional method of estimating scores for MaxDiff experiments involves combining both best and worst choices and estimating as a single multinomial logit model. Fundamental to this analysis is the assumption that the underlying utility or preference dimension is the same whether respondents are indicating which items are best or which are worst. It also assumes that response errors for selecting bests are equivalent to errors when selecting worsts. However, empirical evidence suggests that neither the utility scale nor the error variance is the same for best and worst choices in MaxDiff. Using simulated data, Jack and his co-author Bob investigated to what degree incompatibilities in scale between best and worst choices affects the final utility scores. They adjusted the scale of one set of choices relative to the other by multiplying the design matrix for either best or worst choices by a constant, prior to estimating final utility scores. The final utilities showed the same rank-order before and after the correction, though the utilities did not lie perfectly on a 45-degree line when the two sets were XY scatter plotted. Next, the authors turned to real data. They first measured the scale of bests relative to worsts and next estimated a combined model with correction for scale differences. Whether correcting for scale or not resulted in essentially the same holdout hit rate for HB estimation. They concluded that although combining best and worst judgments without an error scale correction biases the utilities, the resulting rank order of the items remains unchanged and is likely too small to change any business decisions. Thus, the extra work is probably not justified. As a side-note, the authors suggested that comparing best-only and worstonly estimated utilities for each respondent is yet another way to identify and clean noisy respondents. xiii Using Conjoint Analysis to Determine the Market Value of Product Features (Greg Allenby, Ohio State University, Jeff Brazell, The Modellers, John Howell, Penn State University, and Peter Rossi, University of California Los Angeles): The main thrust of this paper was to outline a more defensible approach for using conjoint analysis to attach economic value to specific features than is commonly used in many econometric applications and in intellectual property litigation. Peter (and his co-authors) described how conjoint analysis is often used in high profile lawsuits to assess damages. One of the most common approaches used by expert witnesses is to take the difference in utility (for each respondent) between having and not having the infringed upon feature and dividing it by the price slope. This, Peter argued, is fraught with difficulties including a) certain respondents projected to pay astronomically high amounts for features, and b) the approach ignores important competitive realities in the marketplace. Those wanting to present evidence for high damages are prone to use conjoint analysis in this way because it is relatively inexpensive to conduct and the difference in utility divided by price slope method to compute the economic value of features usually results in very large estimates for damages. Peter and his co-authors argued that to assess the economic value of a feature to a firm requires conducting market simulations (a share of preference analysis) involving a realistic set of competitors, including the outside good (the “None” category). Furthermore, it requires a game theoretic approach to compare the industry equilibrium prices with and without the alleged patent infringement. This involves allowing each competitor to respond to the others via price changes to maximize self-interest (typically, profit). The Ballad of Best and Worst (Tatiana Dyachenko, Rebecca Walker Naylor, and Greg Allenby, Ohio State University): Greg presented research completed primarily by the lead author, Tatiana, at our conference (unfortunately, Tatiana was unable to attend). In that work, Tatiana outlines two different perspectives regarding MaxDiff. Most current economic models for MaxDiff assume that the utilities should be invariant to the elicitation procedure. However, psychological theories would expect different elicitation modes to produce different utilities. Tatiana conducted an empirical study (regarding concerns about hair health among 594 female respondents, aged 50+) to test whether best and worst responses lead to different parameters, to investigate elicitation order effects, and to build a comprehensive model that accounted for both utility differences between best and worst answers as well as order effects (best answered first or worst answered first). Using her model, she and her co-authors found significant terms associated with elicitation order effects and the difference between bests and worsts. In their data, respondents were more sure about the “worsts” than the “bests” (lower error variance around worsts). Furthermore, they found that the second decision made by respondents was less error prone. Tatiana and her co-authors recommend that researchers consider which mode of thinking is most appropriate for the business decision in question: maximizing best aspects or minimizing worst aspects. Since the utilities differ depending on the focus on bests or worsts, the two are not simply interchangeable. But, if researchers decide to ask both best and worsts, they recommend analyzing it using a model such as theirs that can account for differences between bests and worst and also for elicitation order effects. xiv 9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS CHRIS CHAPMAN1 GOOGLE ABSTRACT This paper reflects on observations from over 100 conjoint analysis projects across the industry and multiple companies that I have observed, conducted, or informed. I suggest that clients often misunderstand the results of conjoint analysis (CA) and that the many successes of CA may have created unrealistic expectations about what it can deliver in a single study. I describe some common points of misunderstanding about preference share, feature assessment, average utilities, and pricing. Then I suggest how we might make better use of distribution information from Hierarchical Bayes (HB) estimation and how we might use multiple samples and studies to inform client needs. INTRODUCTION Decades of results from the marketing research community demonstrate that conjoint analysis (CA) is an effective tool to inform strategic and tactical marketing decisions. CA can be used to gauge consumer interest in products and to inform estimates of feature interest, brand equity, product demand, and price sensitivity. In many well-conducted studies, analysts have demonstrated success using CA to predict market share and to determine strategic product line needs.2 However, the successes of CA also raise clients’ expectations to levels that can be excessively optimistic. CA is widely taught in MBA courses, and a new marketer in industry is likely soon to encounter CA success stories and business questions where CA seems appropriate. This is great news . . . if CA is practiced appropriately. The apparent ease of designing, fielding, and analyzing a CA study presents many opportunities for analysts and clients to make mistakes. In this paper, I describe some misunderstandings that I’ve observed in conducting and consulting on more than 100 CA projects. Some of these come from projects I’ve fielded while others have been observed in consultation with others; none is exemplary of any particular firm. Rather, the set of cases reflects my observations of the field. For each one I describe the problem and how I suggest to rectify it in clients’ understanding. All data presented here are fictional. The data primarily concern an imaginary “designer USB drive” that comprises nominal attributes such as size (e.g., Nano, Full-length), design style, ordinal attributes of capacity (e.g., 32 GB), and price. The data were derived by designing a choice-based conjoint analysis survey, having simulated respondents making choices, and estimating the utilities using Hierarchical Bayes multinomial logit estimation. For full details, 1 2 [email protected] There are too many published successes for CA to list them comprehensively. For a start, see papers in this and other volumes of the Proceedings of the Sawtooth Software Conference. Published cases where this author contributed used CA to inform strategic analysis using game theory (Chapman & Love, 2012), to search for optimum product portfolios (Chapman & Alford, 2010), and to predict market share (Chapman, Alford, Johnson, Lahav, & Weidemann, 2009). This author also helped compile evidence of CA reliability and validity (Chapman, Alford, & Love, 2009). 1 refer to the source of the data: simulation and example code given in the R code “Rcbc” (Chapman, Alford, and Ellis, 2013; available from this author). The data here were not designed to illustrate problems; rather, they come from didactic R code. It just happens that those data—like data in most CA projects—are misinterpretable in all the common ways. MISTAKE #1: CONJOINT ANALYSIS DIRECTLY TELLS US HOW MANY PEOPLE WILL BUY THIS PRODUCT A simple client misunderstanding is that CA directly estimates how many consumers will purchase a product. It is simple to use part-worth utilities to estimate preference share and interpret this as “market share.” Table 1 demonstrates this using the multinomial logit formula for aggregate share between two products. In practice, one might use individual-level utilities in a market simulator such as Sawtooth Software SMRT, but the result is conceptually the same. Table 1: Example Preference Share Calculation Product 1 Product 2 Total Sum of utilities 1.0 0.5 -- Exponentiated 2.72 1.65 4.37 Share of total 62% 38% As most research practitioners know but many clients don’t (or forget), the problem is this: preference share is only partially indicative of real market results. Preference share is an important input to a marketing model, yet is only one input among many. Analysts and clients need to determine that the CA model is complete and appropriate (i.e., valid for the market) and that other influences are modeled, such as awareness, promotion, channel effects, competitive response, and perhaps most importantly, the impact of the outside good (in other words, that customers could choose none of the above and spend money elsewhere). I suspect this misunderstanding arises from three sources. First, clients very much want CA to predict share! Second, CA is often given credit for predicting market share even when CA was in fact just one part of a more complex model that mapped CA preference to the market. Third, analysts’ standard practice is to talk about “market simulation” instead of “relative preference simulation.” Instead of claiming to predict market share, I tell clients this: conjoint analysis assesses how many respondents prefer each product, relative to the tested alternatives. If we iterate studies, know that we’re assessing the right things, calibrate to the market, and include other effects, we will get progressively better estimates of the likely market response. CA is a fundamental part of that, yet only one part. Yes, we can predict market share (sometimes)! But an isolated, singleshot CA is not likely to do so very well. MISTAKE #2: CA ASSESSES HOW GOOD OR BAD A FEATURE (OR PRODUCT) IS The second misunderstanding is similar to the first: clients often believe that the highest partworth indicates a good feature while negative part-worths indicate bad ones. Of course, all utilities really tell us is that, given the set of features and levels presented, this is the best fit to a 2 set of observed choices. Utilities don’t indicate absolute worth; inclusion of different levels likely would change the utilities. A related issue is that part-worths are relative within a single attribute. We can compare levels of an attribute to one another—for instance, to say that one memory size is preferable to another memory size—but should not directly compare the utilities of levels across attributes (for instance, to say that some memory size level is more or less preferred than some level of color or brand or processor). Ultimately, product preference involves full specification across multiple attributes and is tested in a market simulator (I say more about that below). I tell clients this: CA assesses tradeoffs among features to be more or less preferred. It does not assess absolute worth or say anything about untested features. MISTAKE #3: CA DIRECTLY TELLS US WHERE TO SET PRICES Clients and analysts commonly select CA as a way to assess pricing. What is the right price? How will such-and-such feature affect price? How price sensitive is our audience? All too often, I’ve seen clients inspect the average part-worths for price—often estimated without constraints and as piecewise utilities—and interpret them at face value. Figure 1 shows three common patterns in price utilities; the dashed line shows scaling in exact inverse proportion to price, while the solid line plots the preference that we might observe from CA (assuming a linear function for patterns A and B, and a piecewise estimation in pattern C, although A and B could just as well be piecewise functions that are monotonically decreasing). In pattern A, estimated preference share declines more slowly than price (or log price) increases. Clients love this: the implication is to price at the maximum (presumably not to infinity). Unfortunately, real markets rarely work that way; this pattern more likely reflects a method effect where CA underestimates price elasticity. Figure 1: Common Patterns in Price Utilities A: Inelastic demand B: Elastic demand C: Curved demand In pattern B, the implication is to price at the minimum. The problem here is that relative preference implies range dependency. This may simply reflect the price range tested, or reflect that respondents are using the survey for communication purposes (“price low!”) rather than to express product preferences. Pattern C seems to say that some respondents like low prices while others prefer high prices. Clients love this, too! They often ask, “How do we reach the price-insensitive customers?” The 3 problem is that there is no good theory as to why price should show such an effect. It is more likely that the CA task was poorly designed or confusing, or that respondents had different goals such as picking their favorite brand or heuristically simplifying the task in order to complete it quickly. Observation of a price reversal as we see here (i.e., preference going up as price goes up in some part of the curve) is more likely an indication of a problem than an observation about actual respondent preference! If pattern C truly does reflect a mixture of populations (elastic and inelastic respondents) then there are higher-order questions about the sample validity and the appropriateness of using pooled data to estimate a single model. In short: pattern C is seductive! Don’t believe it unless you have assessed carefully and ruled out the confounds and the more theoretically sound constrained (declining) price utilities. What I tell clients about price is: CA provides insight into stated price sensitivity, not exact price points or demand estimates without a lot more work and careful consideration of models, potentially including assessments that attempt more realistic incentives, such as incentivealigned conjoint analysis (Ding, 2007). When assessing price, it’s advantageous to use multiple methods and/or studies to confirm that answers are consistent. MISTAKE #4: THE AVERAGE UTILITY IS THE BEST MEASURE OF INTEREST I often see—and yes, sometimes even produce—client deliverables with tables or charts of “average utilities” by level. This unfortunately reinforces a common cognitive error: that the average is the best estimate. Mathematically, of course, the mean of a distribution minimizes some kinds of residuals—but that is rarely how a client interprets an average! Consider Table 2. Clients interpret this as saying that Black is a much better feature than Tiedye. Sophisticated ones might ask whether it is statistically significant (“yes”) or compute the preference share for Black (84%). None of that answers the real question: which is better for the decision at hand? Table 2: Average Feature Utilities Feature Black Tie-dye ... Average Utility 0.79 -0.85 ... Figure 3 is what I prefer to show clients and presents a very different picture. In examining Black vs. Tie-dye, we see that the individual-level estimates for Black have low variance while Tie-dye has high variance. Black is broadly acceptable, relative to other choices, while Tie-dye is polarizing. Is one better? That depends on the goal. If we can only make a single product, we might choose Black. If we want a diverse portfolio with differently appealing products, Tie-dye might fit. If we have a way to reach respondents directly, then Silver might be appealing because a few people strongly prefer it. Ultimately this decision should be made on the basis of market simulation (more on that below), yet understanding the preference structure more fully may help an analyst understand the market and generate hypotheses that otherwise might be overlooked. 4 Figure 3: Distribution of Individual-Level Utilities from HB Estimation The client takeaway is this: CA (using HB) gives us a lot more information than just average utility. We should use that information to have a much better understanding of the distribution of preference. MISTAKE #5: THERE IS A TRUE SCORE The issue about average utility (problem #4 above) also arises at the individual level. Consider Figure 4, which presents the mean betas for one respondent. This respondent has low utilities for features 6 and 10 (on the X axis) and high utilities for features 2, 5, and 9. It is appealing to think that we have a psychic X-ray of this respondent, that there is some “true score” underlying these preferences, as a social scientist might say. There are several problems with this view. One is that behavior is contextually dependent, so any respondent might very well behave differently at another time or in another context (such as a store instead of a survey). Yet even within the context of a CA study, there is another issue: we know much more about the respondent than the average utility! Figure 4: Average Utility by Feature, for One Respondent Now compare Figure 5 with Figure 4. Figure 5 shows—for the same respondent—the withinrespondent distribution of utility estimates across 100 draws of HB estimates (using Monte Carlo Markov chain, or MCMC estimation). We see significant heterogeneity. An 80% or 95% credible interval on the estimates would find few “significant” differences for this respondent. This is a 5 more robust picture of the respondent, and inclines us away from thinking of him or her as a “type.” Figure 5: Distribution of HB Beta Estimates by Feature, for the Same Respondent What I tell clients is this: understand respondents in terms of tendency rather than type. Customers behave differently in different contexts and there is uncertainty in CA assessment. The significance of that fact depends on our decisions, business goals, and ability to reach customers. MISTAKE #6: CA TELLS US THE BEST PRODUCT TO MAKE (RATHER EASILY) Some clients and analysts realize that CA can be used not only to assess preference share and price sensitivity but also to inform a product portfolio. In other words, to answer “What should we make?” An almost certainly wrong answer would be to make the product with highest utility, because it is unlikely that the most desirable features would be paired with the best brand and lowest price. A more sophisticated answer searches for preference tradeoff vs. cost in the context of a competitive set. However, this method capitalizes on error and precise specification of the competitive sets; it does not examine the sensitivity and generality of the result. Better results may come by searching for a large set of near-optimum products and examine their commonalities (Chapman and Alford, 2010; cf. Belloni et al., 2008). Another approach, depending on the business question, would be to examine likely competitive response to a decision using a strategic modeling approach (Chapman and Love, 2012). An analyst could combine the approaches: investigate a set of many potential near-optimal products, choose a set of products that is feasible, and then investigate how competition might respond to that line. Doing this is a complex process: it requires extraordinarily high confidence in one’s data, and then one must address crucial model assumptions and adapt (or develop) custom code in R or some other language to estimate the models (Chapman and Alford, 2010; Chapman and Love, 2012). The results can be extremely informative—for instance, a product identified in Chapman and Alford (2010) was identified by the model fully 17 months in advance of its introduction to the market by a competitor—but arriving at such an outcome is a complex undertaking built on impeccable data (and perhaps luck). 6 In short, when clients wish to find the “best product,” I explain: CA informs us about our line, but precise optimization requires more models, data, and expertise. MISTAKE #7: GET AS MUCH STATISTICAL POWER (SAMPLE) AS POSSIBLE This issue is not specific to CA but to research in general. Too many clients (and analysts) are impressed with sample size and automatically assume that more sample is better. Figure 6 shows the schematic of a choice-based conjoint analysis (CBC) study I once observed. The analyst had a complex model with limited sample and wanted to obtain adequate power. Each CBC task presented 3 products and a None option . . . and respondents were asked to complete 60 such tasks! Figure 6: A Conjoint Analysis Study with Great “Power” Power is directly related to confidence intervals, and the problem with confidence intervals (in classical statistics) is that they scale to the inverse square root of sample size. When you double the sample size, you only reduce the confidence interval by 30% (1-1/√2). To cut the confidence interval in half requires 4x the sample size. This has two problems: diminishing returns, and lack of robustness to sample misspecification. If your sample is a non-probability sample, as most are, then sampling more of it may not be the best approach. I prefer instead to approach sample size this way: determine the minimum sample needed to give an adequate business answer, and then split the available sampling resources into multiple chunks of that size, assessing each one with varying methods and/or sampling techniques. We can have much higher confidence when findings come from multiple samples using multiple methods. What I tell clients: instead of worrying about more and more statistical significance, we should maximize interpretative power and minimize risk. I sketch what such multiple assessments might look like. “Would you rather have: (1) Study A with N=10000, or (2) Study A with 1200, Study B with 300, Study C with 200, and Study D with 800?” Good clients understand immediately that despite having ¼ the sample, Plan 2 may be much more informative! 7 MISTAKE #8: MAKE CA FIT WHAT YOU WANT TO KNOW To address tough business questions, it’s a good idea to collect customer data with a method like CA. Unfortunately, this may yield surveys that are more meaningful to the client than the respondent. I find this often occurs with complex technical features (that customers may not understand) and messaging statements (that may not influence CA survey behavior). Figure 7 presents a fictional CBC task about wine preferences. It was inspired by a poorly designed survey I once took about home improvement products; I selected wine as the example because it makes the issue particularly obvious. Figure 7: A CBC about Wine Imagine you are selecting a bottle of wine for a special celebration dinner at home. If the following wines were your only available choices, which would you purchase? 75% Cabernet Sauvignon 75% Cabernet Sauvignon Blend 20% Merlot 15% Merlot 4% Cabernet Franc 10% Cabernet Franc 1% Malbec Custom crush Negotiant Bottle size 700ml 750ml Cork type Grade 2 Double disk (1+1) (None, unfined) Potassium caseinate Bottling line type Mobile On premises Origin of bottle glass Mexico China ◌ ◌ Winery type Fining agent Our fictional marketing manager is hoping to answer questions like these: should we fine our wines (cause them to precipitate sediment before bottling)? Can we consider cheaper bottle sources? Should we invest in an in-house bottling line (instead of truck that moves between facilities)? Can we increase the Cabernet Franc in our blend (for various possible reasons)? And so forth. Those are all important questions but posing their technical features to customers results in a survey that only a winemaker could answer! A better survey would map the business consideration to features that a consumer can address, such as taste, appearance, aging potential, cost, and critics’ scores. (I leave the question of how to design that survey about wine as an exercise for the reader.) This example is extreme, yet how often do we commit similar mistakes in areas where we are too close to the business? How often do we test something “just to see if it has an effect?” How often do we describe something the way that R&D wants? Or include a message that has little if any real information? And then, when we see a null effect, are we sure that it is because customers don’t care, or could it be because the task was bad? (A similar question may be asked in case of significant effects.) And, perhaps most dangerously, how often do we field a CA without doing do a small-sample pretest? 8 The implication is obvious: design CA tasks to match what respondents can answer reliably and validly. And before fielding, pretest the attributes, levels, and tasks to make sure! (NON!-) MISTAKE #9: IT’S BETTER THAN USING OUR INSTINCTS Clients, stakeholders, managers, and sometimes even analysts are known to say, “Those results are interesting but I just don’t believe them!” Then an opinion is substituted for the data. Of course CA is not perfect—all of the above points demonstrate ways in which it may go wrong, and there are many more—but I would wager this: a well-designed, well-fielded CA is almost always better than expert opinion. Opinions of those close to a product are often dramatically incorrect (cf. Gourville, 2004). Unless you have better and more reliable data that contradicts a CA, go with the CA. If we consider this question in terms of expected payoff, I propose that the situation resembles Figure 8. If we use data, our estimates are likely to be closer to the truth than if we don’t. Sometimes they will be wrong, but will not be as wrong on average as opinion would be. Figure 8: Expected Payoffs with and without Data Use data Use instinct Decision correct Decision incorrect High precision (high gain) Low precision (modest gain) Low inaccuracy (modest loss) High inaccuracy (large loss) Net expectation: Positive Negative When we get a decision right with data, the relative payoff is much larger. Opinion is sometimes right, but likely to be imprecise; when it is wrong, expert opinion may be disastrously wrong. On the other hand, I have yet to observe a case where consumer data has been terribly misleading; the worst case I’ve seen is when it signals a need to learn more. When opinion and data disagree, explore more. Do a different study, with a different method and different sampling. What I tell clients: it’s very risky to bet against what your customers are telling you! An occasional success—or an excessively successful single opiner—does not disprove the value of data. MISTAKE #10 AND COUNTING Keith Chrzan (2013) commented on this paper after presentation at the Sawtooth Software Conference and noted that attribute importance is another area where there is widespread confusion. Clients often want to know “Which attributes are most important?” but CA can only answer this with regard to the relative utilities of the attributes and features tested. Including (or omitting) a very popular or unpopular level on one attribute will alter the “importance” of every other attribute! CONCLUSION Conjoint analysis is a powerful tool but its power and success also create conditions where client expectations may be too high. We’ve seen that some of the simplest ways to view CA 9 results such as average utilities may be misleading, and that despite client enthusiasm they may distract from answering more precise business questions. The best way to meet high expectations is to meet them! This may require all of us to be more careful in our communications, analyses, and presentations. The issues here are not principally technical in nature; rather they are about how conjoint analysis is positioned and how expectations are set and upheld through effective study design, analysis, and interpretation. I hope the paper inspires you—and even better, inspires and informs clients. Chris Chapman ACKNOWLEDGEMENTS I’d like to thank Bryan Orme, who provided careful, thoughtful, and very helpful feedback at several points to improve both this paper and the conference presentation. If this paper is useful to the reader, that is in large part due to Bryan’s suggestions (and if it’s not useful, that’s due to the author!) Keith Chrzan also provided thoughtful observation and reflections during the conference. Finally, I’d like to thank all my colleagues over the years, and who are reflected in the reference list. They spurred the reflections more than anything I did. REFERENCES Belloni, A., Freund, R.M, Selove, M., and Simester, D. (2008). Optimal product line design: efficient methods and comparisons. Management Science 54: 9, September 2008, pp. 1544– 1552. Chapman, C.N., Alford, J.L., and Ellis, S. (2013). Rcbc: marketing research tools for choicebased conjoint analysis, version 0.201. [R code Chapman, C.N., and Love, E. (2012). Game theory and conjoint analysis: using choice data for strategic decisions. Proceedings of the 2012 Sawtooth Software Conference, Orlando, FL, March 2012. Chapman, C.N., and Alford, J.L. (2010). Product portfolio evaluation using choice modeling and genetic algorithms. Proceedings of the 2010 Sawtooth Software Conference, Newport Beach, CA, October 2010. 10 Chapman, C.N., Alford, J.L., Johnson, C., Lahav, M., and Weidemann, R. (2009). Comparing results of CBC and ACBC with real product selection. Proceedings of the 2009 Sawtooth Software Conference, Del Ray Beach, FL, March 2009. Chapman, C.N., Alford, J.L., and Love, E. (2009). Exploring the reliability and validity of conjoint analysis studies. Presented at Advanced Research Techniques Forum (A/R/T Forum), Whistler, BC, June 2009. Chrzan, K. (2013). Remarks on “9 things clients wrong about conjoint analysis.” Discussion at the 2013 Sawtooth Software Conference, Dana Point, CA, October 2013. Ding, M. (2007). An incentive-aligned mechanism for conjoint analysis. Journal of Marketing Research, 2007, pp. 214–223. Gourville, J. (2004). Why customers don’t buy: the psychology of new product adoption. Case study series, paper 9-504-056. Harvard Business School, Boston, MA. 11 QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A TRADITIONAL MANUFACTURING FIRM: UPDATE AND CASE STUDY ROBERT J. GOODWIN LIFETIME PRODUCTS, INC. ABSTRACT Lifetime Products, Inc., a manufacturer of folding furniture and other consumer hard goods, provides a progress report on its quest for more effective analytic methods and offers an insightful new ACBC case study. This demonstration of a typical adaptive choice study, enhanced by an experiment with conjoint analysis design parameters, is intended to be of interest to new practitioners and experienced users alike. INTRODUCTION Lifetime Products, Inc. is a privately held, vertically integrated manufacturing company headquartered in Clearfield, Utah. The company manufactures consumer hard goods typically constructed of blow-molded polyethylene resin and powder-coated steel. Its products are sold to consumers and businesses worldwide, primarily through a wide range of discount and department stores, home improvement centers, warehouse clubs, sporting goods stores, and other retail and online outlets. Over the past seven years, the Lifetime Marketing Research Department has adopted progressively more sophisticated conjoint analysis and other quantitative marketing research tools to better inform product development and marketing decision-making. The company’s experiences in adopting and cost-effectively utilizing these sophisticated analytic methods— culminating in its current use of Sawtooth Software’s Adaptive Choice-Based Conjoint (ACBC) software—were documented in papers presented at previous Sawtooth Software Conferences (Goodwin 2009, and Goodwin 2010). In this paper, we first provide an update on what Lifetime Products has learned about conjoint analysis and potential best practices thereof over the past three years. Then, for demonstration purposes, we present a new Adaptive CBC case on outdoor storage sheds. The paper concludes with a discussion of our experimentation with ACBC design parameters in this shed study. I. WHAT WE’VE LEARNED ABOUT CONJOINT ANALYSIS This section provides some practical advice, intended primarily for new and corporate practitioners of conjoint analysis and other quantitative marketing tools. This is based on our experience at Lifetime Products as a “formerly new” corporate practitioner of conjoint analysis. 13 #1. Use Prudence in Conjoint Analysis Design One of the things that helped drive our adoption of Sawtooth Software’s Adaptive ChoiceBased Conjoint program was its ability to administer conjoint analysis designs with large numbers of attributes, without overburdening respondents. The Concept Screening phase of the ACBC protocol allows each panelist to create a “short list” of potentially acceptable concepts using whatever decision-simplification techniques s/he wishes, electing to downplay or even ignore attributes they considered less essential to the purchase decision. Further, we could allow them to select a subset of the most important attributes for inclusion (or, alternatively, the least important attributes for exclusion) for the rest of the conjoint experiment. Figure 1 shows an example page from our first ACBC study on Storage Sheds in 2008. Note that, in the responses entered, the respondent has selected eight attributes to include—along with price and materials of construction, which were crucial elements in the experiment—while implicitly excluding the other six attributes from further consideration. Constructed lists could then be used to bring forward only the Top-10 attributes from an original pool of 16 attributes, making the exercise much more manageable for the respondent. While the part-worths for an excluded attribute would be zero for that observation, we would still capture the relevant utility of that attribute for another panelist who retained it for further consideration in the purchase decision. Figure 1 Example of Large-scale ACBC Design STORAGE SHEDS We utilized this “winnowing” feature of ACBC for several other complex-design studies in the year or two following its adoption at Lifetime. During the presentation of those studies to our internal clients, we noticed a few interesting behaviors. One was the virtual fixation of a few clients on a “pet” feature that (to their dismay) registered minimal decisional importance 14 following Hierarchical Bayes (HB) estimation. While paying very little attention to the most important attributes in the experiment, they spent considerable time trying to modify the attribute to improve its role in the purchase decision. In essence, this diverted their attention from what mattered most in the consumers’ decision to what mattered least. The more common client behavior was what could be called a “reality-check” effect. Once the clients realized (and accepted) the minimal impact of such an attribute on purchase decisions, they immediately began to concentrate on the more important array of attributes. Therefore, when it came time to do another (similar) conjoint study, they were less eager to load up the design with every conceivable attribute that might affect purchase likelihood. Since that time, we have tended not to load up a conjoint study with large numbers of attributes and levels, just because “it’s possible.” Instead, we have sought designs that are more parsimonious by eliminating attributes and levels that we already know to be less important in consumers’ decision-making. As a result, most of our recent studies have gravitated around designs of 8–10 attributes and 20–30 levels. Occasionally, we have found it useful to assess less-important attributes—or those that might be more difficult to measure in a conjoint instrument—by testing them in regular questions following the end of the conjoint experiment. For example, Figure 2 shows a follow-up question in our 2013 Shed conjoint analysis survey to gauge consumers’ preference for a shed that emphasized ease of assembly (at the expense of strength) vis-à-vis a shed that emphasized strength (at the expense of longer assembly times). (This issue is relevant to Lifetime Products, since our sheds have relatively large quantities of screws—making for longer assembly times— but are stronger than most competitors’ sheds.) Figure 2 Example of Post-Conjoint Preference Question STORAGE SHEDS #2. Spend Time to Refine Conjoint Analysis Instruments Given the importance of the respondent being able to understand product features and attributes, we have found it useful to spend extra time on the front end to ensure that the survey instrument and conjoint analysis design will yield high-quality results. In a previous paper (Goodwin, 2009), we reported the value of involving clients in instrument testing and debugging. In a more general sense, we continue to review our conjoint analysis instruments and designs with multiple iterations of client critique and feedback. 15 As we do so, we look out for several potential issues that could degrade the quality of conjoint analysis results. First, we wordsmith attribute and level descriptions to maximize clarity. For example, with some of our categories, we have found a general lack of understanding in the marketplace regarding some attributes (such as basketball height-adjustment mechanisms and backboard materials; shed wall, roof and floor materials; etc.). Attributes such as these necessitate great care to employ verbiage that is understandable to consumers. Another area we look out for involves close substitutes among levels of a given attribute, where differences might be difficult for consumers to perceive, even in the actual retail environment. For example, most mid-range basketball goals have backboard widths between 48 and 54 inches, in 2-inch increments. While most consumers can differentiate well between backboards at opposite ends of this range, they frequently have difficulty deciding—or even differentiating—among backboard sizes with 2-inch size differences. Recent qualitative research with basketball system owners has shown that, even while looking at 50-inch and 52-inch models side-by-side in a store, it is sometimes difficult for them (without looking at the product labeling) to tell which one is larger than the other. While our effort is not to force product discrimination in a survey where it may not exist that strongly in the marketplace itself, we want to ensure that panelists are given a realistic set of options to choose from (i.e., so the survey instrument is not the “problem”). Frequently, this means adding pictures or large labels showing product size or feature call-outs to mimic in-store shopping as much as possible. #3. More Judicious with Brand Names Lifetime Products is not a household name like Coke, Ford, Apple, and McDonald’s. As a brand sold primarily through big-box retailers, Lifetime Products is well known among the category buyers who put our product on the shelf, but less so among consumers who take it off the shelf. In many of our categories (such as tables & chairs, basketball, and sheds), the assortment of brands in a given store is limited. Consequently, consumers tend to trust the retailer to be the brand “gatekeeper” and to carry only the best and most reliable brands. In doing so, they often rely less on their own brand perceptions and experiences. Lifetime Products’ brand image is also confounded by misconceptions regarding other entities such as the Lifetime Movie Channel, Lifetime Fitness Equipment, and even “lifetime” warranty. There are also perceptual anomalies among competitor brands. For example, Samsonite (folding tables) gets a boost from their well-known luggage line, Cosco (folding chairs) is sometimes mistaken for the Costco store brand, and Rubbermaid (storage sheds) has a halo effect from the wide array of Rubbermaid household products. Further, Lifetime kayaks participate in a market that is highly fragmented with more than two dozen small brands, few of which have significant brand awareness. As a result, many conjoint analysis studies we have done produce flat brand utility profiles, accompanied by low average-attribute-importance scores. This is especially the case when we include large numbers of brand names in the exercise. Many of these brands end up with utility scores lower than the “no brand” option, despite being well regarded by retail chain store buyers. Because of these somewhat-unique circumstances in our business, Lifetime often uses heavily abridged brand lists in its conjoint studies, or in some cases drops the brand attribute altogether. In addition, in our most recent kayak industry study (with its plethora of unknown 16 brands), we had to resort to surrogate descriptions such as “a brand I do not know,” “a brand I know to be good,” and so forth. #4. Use Simulations to Estimate Brand Equity Despite the foregoing, there are exceptions (most notably in the Tables & Chairs category) where the Lifetime brand is relatively well known and has a long sales history among a few key retailers. In this case, our brand conjoint results are more realistic, and we often find good perceptual differentiation among key brand names, including Lifetime. Lifetime sales managers often experience price resistance from retail buyers, particularly in the face of new, lower-price competition from virtually unknown brands (in essence, “no brand”). In instances like these, it is often beneficial to arm these sales managers with statistical evidence of the value of the Lifetime brand as part of its overall product offering. Recently, we generated such a brand equity analysis for folding utility tables using a reliable conjoint study conducted a few years ago. In this context, we defined “per-unit brand equity” as: The price reduction a “no-name” brand would have to use in order to replace the Lifetime brand and maintain Lifetime’s market penetration. The procedure we used for this brand equity estimation was as follows: 1. Generate a standard share of preference simulation, with the Lifetime table at its manufacturer’s suggested retail price (MSRP), two competitor offerings at their respective MSRPs, and the “None” option. (See left pie chart in Figure 3.) 2. Re-run the simulation using “no brand name” in place of the Lifetime brand, with no other changes in product specifications (i.e., an exact duplicate offering except for the brand name). The resulting share of preference for the “no-name” offering (which otherwise duplicated the Lifetime attributes) decreased from the base-case share. (Note that much of that preference degradation went to the “None” option, not to the other competitors, suggesting possible strength of the Lifetime brand over the existing competitors as well.) 3. Gradually decrease the price of the “no-name” offering until its share of preference matched the original base case for the Lifetime offering. In this case, the price differential was about -6%, which represents a reasonable estimate of the value of the Lifetime brand, ceteris paribus. In other words, a no-name competitor with the same specification as the Lifetime table would have to reduce its price 6% in order to maintain the same share of preference as the Lifetime table. (See right pie chart in Figure 3.) 17 Figure 3 Method to Estimate Lifetime Brand Value Conjoint – Successes Price Difference for Lifetime vs. No-Brand: Estimate of Lifetime Brand Value Shares of Preference When Lifetime Brand Available Would Not Buy Any of These Lifetime Table @ Retail Price Shares of Preference When Replaced by “No-Name” 6% Price Reduction Needed to Garner the Same Share as Lifetime Would Not Buy Any of These Brand X Table Brand Y Table Goodwin 9/18/13 No-Brand Table @ 6% Lower Price Brand X Table Brand Y Table 15 The 6% brand value may not seem high, compared with the perceived value of more wellknown consumer brands. Nevertheless, in the case of Lifetime tables, this information was very helpful for our sales managers responding to queries from retail accounts to help justify higher wholesale prices than the competition. #5. Improved Our Simulation Techniques Over the past half-dozen years of using conjoint analysis, Lifetime has improved its use of simulation techniques to help inform management decisions and sales approaches. In these simulations, we generally have found it most useful to use the “None” option in order to capture buy/no-buy behavior and to place relatively greater emphasis on share of preference among likely buyers. Most importantly, this approach allows us to measure the possible expansion (or contraction) of the market due to the introduction of new product (or the deletion of an existing product). We have found this approach particularly useful when simulating the behavior of likely customers of our key retail partners and the change in their “market” size. Recently, we conducted several simulation analyses to test the impact of pricing strategies for our retail partners. We offer two of them here. In both cases, the procedure was to generate a baseline simulation, not only of shares of preference (i.e., number of units), but also of revenue and (where available) retail margin. We then conducted experimental “what-if” simulations to compare with the baseline scenario. Because both situations involved multiple products—and the potential for cross-cannibalization, we measured performance for the entire product line at the retailer. 18 The first example involved a lineup of folding tables at a relatively large retail account (See Figure 4). In question was the price of a key table model in that lineup, identified as Table Q in the graphic. The lines in the graphic represent changes in overall table lineup units, revenue, and retail margin, indexed against the current price-point scenario for Table Q (Index = 1.00). We ran a number of experimental simulations based on adjustments to the Table Q price point and ran the share of preference changes through pricing and margin calculations for the entire lineup. Figure 4 Using Simulations to Measure Unit, Revenue & Margin Changes As might be expected, decreasing the price of Table Q (holding all other prices and options constant) would result in moderate increases in overall numbers of units sold (solid line), smaller increases in revenue (due to the lower weighted-average price with the Table Q price cut; dashed line), and decreases in retail margin (dotted line). Note that these margin decreases would rapidly become severe, since the absolute value of a price decrease is applied to a much smaller margin base. (See curve configurations to the left of the crossover point in Figure 4.) On the other hand, if the price of Table Q were to be increased, the effects would go in the opposite direction in each case: margin would increase, and revenue and units would decrease. (See curve configurations to the right of the crossover point in Figure 4.) This Figure 4 graphic, along with the precise estimates of units, revenue, and margin changes with various Table Q price adjustments, provided the account with various options to be considered, in light of their retail objectives to balance unit, revenue, and margin objectives. The second example involves the prospective introduction of a new and innovative version of a furniture product (non-Lifetime) already sold by a given retailer. Three variants of the new product were tested: Product A at high and moderate price levels and Product B (an inferior 19 version of Product A) at a relatively low price. Each of these product-price scenarios was compared with the base case for the current product continuing to be sold by itself. And, in contrast to the previous table example, only units (share) and revenue were measured in this experiment (retail margin data for the existing product were not available). (See Figure 5) Figure 5 Using Simulations to Inform the Introduction of a New Product Note retailer’s increase in total unit volume with introduction of New Product “Sweet Spot” Product/ Pricing Options Goodwin 9/18/13 18 The first result to note (especially from the retailer’s perspective) is the overall expansion of unit sales under all three new-product-introduction scenarios. This would make sense, since in each case there would be two options for consumers to consider, thus reducing the proportion of retailers’ consumers who would not buy either option at this store (light gray bar portions at top). The second finding of note (especially from Lifetime’s point of view) was that unit sales of the new concept (black bars at the bottom) would be maximized by introducing Product A at the moderate price. Of course, this would also result in the smallest net unit sales of the existing product (dark gray bars in the middle). Finally, the matter of revenue is considered. As seen in the revenue index numbers at the bottom margin of the graphic (with current case indexed at 1.00), overall retail revenue for this category would be maximized by introducing Product A at the high price (an increase of 26% over current retail revenue). However, it also should be noted that introducing Product A at the moderate price would also result in a sizable increase in revenue over the current case (+21%, only slightly lower than that of the high price). Thus Lifetime and retailer were presented with some interesting options (see the “Sweet Spot” callout in Figure 5), depending on how unit and revenue objectives for both current and new products were enforced. And, of course, they also could consider introduction of Product B 20 at its low price point, which would result in the greatest penetration among the retailer’s customers, but at the expensive of almost zero growth in overall revenue. #6. Maintain a Standard of Academic Rigor Let’s face it: in a corporate-practitioner setting (especially as the sole conjoint analysis “expert” in the company), it’s sometimes easy to become lackadaisical about doing conjoint analysis right! It’s easy to consider taking short cuts. It’s easy to take the easy route with a shorter questionnaire instead of the recommended number of ACBC Screening steps. And, it’s easy to exclude holdout tasks in order to keep the survey length down and simplify the analysis. Over the past year or so, we have concluded that, in order to prevent becoming too complacent, a corporate practitioner of conjoint analysis may need to proactively maintain a standard of academic rigor in his/her work. It is important to stay immersed in conjoint analysis principles and methodology through seminars, conferences (even if the technical/mathematical details are a bit of a comprehension “stretch”), and the literature. And, in the final analysis, there’s nothing like doing a paper for one of those conferences to re-institute best practices! Ideally a paper such as this should include some element of “research on research” (experiment with methods, settings, etc.) to stretch one’s capabilities even further. II. 2013 STORAGE SHED ACBC CASE STUDY It had been nearly five years since Lifetime’s last U.S. storage shed conjoint study when the Company’s Lawn & Garden management team requested an update to the earlier study. In seeking a new study, their objective was to better inform current strategic planning, tactical decision-making, and sales presentations for this important category. At the same time, this shed study “refresh” presented itself as an ideal vehicle as a case study for the current Sawtooth Software conference paper to illustrate the Company’s recent progress in its conjoint analysis practices. The specific objectives for including this case study in this current paper are shown below. Demonstrate a typical conjoint analysis done by a private practitioner in an industrial setting. Validate the new conjoint model by comparing market simulations with in-sample holdout preference tasks (test-retest format). Include a “research on research” split-sample test on the effects of three different ACBC research design settings. An overview of the 2013 U.S. Storage Shed ACBC study is included in the table below: 21 ACBC Instrument Example Screenshots As many users and students of conjoint analysis know, Sawtooth Software’s Adaptive Choice-based conjoint protocol begins with a Build-Your-Own (BYO) exercise to establish the respondent’s preference positioning within the array of all possible configurations of the product in question. (Figure 12, to be introduced later, illustrates this positioning visually.) Figure 6 shows a screenshot of the BYO exercise for the current storage shed conjoint study. 22 Figure 6 Build-Your-Own (BYO) Shed Exercise The design with 9 non-price attributes and 25 total levels results in a total of 7,776 possible product configurations. In addition, the range of summed prices from $199 to $1,474, amplified by a random price variation factor of ±30 percent (in the Screening phase of the protocol, to follow) provides a virtually infinite array of product-price possibilities. Note the use of conditional graphics (shed rendering at upper right) to help illustrate three key attributes that drive most shed purchase decisions (square footage, roof height, and materials of construction). Following creation of the panelist’s preferred (BYO) shed design, the survey protocol asks him/her to consider a series of “near-neighbor” concepts and to designate whether or not each one is a possibility for purchase consideration. (See Figure 7 and, later, Figure 12.) In essence, the subject is asked to build a consideration set of possible product configurations from which s/he will ultimately select a new favorite design. This screening exercise also captures any non- 23 compensatory selection behaviors, as the respondent can designate some attribute levels as ones s/he must have—or wants to exclude—regardless of price. Figure 7 Shed Concept Screening Exercise Note again the use of conditional graphics, which help guide the respondents by mimicking a side-by-side visual comparison common to many retail shed displays. Following the screening exercise and the creation of an individualized short list of possible configurations, these concepts are arrayed in a multi-round “tournament” setting where the panelist ultimately designates the “best” product-price option. Conditional graphics again help facilitate these tournament choices. (See Figure 8) 24 Figure 8 Shed Concept “Tournament” Exercise The essence of the conjoint exercise is not to derive the “best” configuration, however. Rather, it is to discover empirically how the panelist makes the simulated purchase decision, including the - relative importance of the attributes in that decision, levels of each attribute that are preferred, interaction of these preferences across all attributes and levels, and implicit price sensitivity for these features, individually and collectively. As we like to tell our clients trying to understand the workings and uses of conjoint analysis, “It’s the journey—not the destination—that’s most important with conjoint analysis.” ACBC Example Diagnostic Graphics Notwithstanding the ultimate best use of conjoint analysis as a tool for market simulations, there are a few diagnostic reports and graphics that help clients understand what the program is doing for their respective study. First among these is the average attribute-importance distribution, in this case derived through Hierarchical Bayes estimation of the individual partworths from the Multinomial Logit procedure. (See Figure 9) 25 Figure 9 Relative Importance of Attributes from HB Estimation It should be noted that these are only average importance scores, and that the simulations ultimately will take into account each individual respondent’s preferences, especially if those preferences are far different from the average. Nevertheless, our clients (especially sales managers who are reporting these findings to retail chain buyers) can relate to interpretations such as “20 percent of a typical shed purchase decision involves—or is influenced by—the size of the shed.” Note in this graphic that price occupies well over one-third of the decision space for this array of shed products. This is due in large part to the wide range of prices ($199 minus 30 percent, up to $1,474 plus 30 percent) necessary to cover the range of sheds from a 25-squarefoot sheet metal model with few add-on features up to a 100-square-foot wooden model with multiple add-ons. Within a defined sub-range of shed possibilities, most consumers would consider (say, plastic sheds in the 50-to-75-square-foot range, with several feature add-ons), the relative importance of price would diminish markedly and the importance of other attributes would increase. A companion set of diagnostics to the importance pie chart above involves line graphs showing the relative conjoint utility scores (usually zero-centered) showing the relative preferences for levels within each attribute. Again, recognizing that these are only averages, they provide a quick snapshot of the overall preference profile for attributes and levels. They also provide a good diagnostic to see if there are any reversals (e.g., ordinal-scale levels that do not follow a consistent progression of increasing or decreasing utility scores). (See Figure 10) 26 Figure 10 Average Conjoint Utility Scores from HB Estimation Lifetime Shed Conjoint Study 2013 Average Conjoint Utility Comparison (selected attributes) Survey Sampling Inc. - Nationwide n=643 80 60 Average Conjoint Utility 40 20 0 -20 -40 -60 -80 CONSTRUCTION SQUARE FOOTAGE ROOF HEIGHT WALL STYLE FLOORING 2 shelves NOT included Plywood Plastic NOT included Brick-style Siding-style Plain 8 Feet 6 Feet 100 SF (c. 10'x10') 75 SF (c. 8 'x8 ') 50 SF (c. 7'x7') 25 SF (c. 5'x5') Treated Wood Steel-reinforced Resin Sheet Metal -100 SHELVING The final diagnostic graphic we offer is the Price Utility Curve. (See Figure 11) It is akin to the Average Level Utility Scores, just shown, except that (a) in contrast to most feature-based attributes, its curve has a negative slope, and (b) it can have multiple, independently sloped curve segments (eight in this case), using ACBC’s Piecewise Pricing estimation option. Our clients can also relate to this as a surrogate representation for a demand curve, with varying slopes (price sensitivity). 27 Figure 11 Price Utility Curve Using Piecewise Method Lifetime Shed Conjoint Study 2013 Price Utility Curves using Piecewise Method: Negative Price Constraint Survey Sampling Inc. - Nationwide n=571 net Price Utility Score (higher = more preferred) 200 Relevant Range for Plastic Sheds (25-100 SF) 150 Wooden Sheds 100 50 Sheet Metal Sheds 0 -50 -100 Note possible perceptual breakpoint at $999 -150 -200 -250 $0 $200 $400 $600 $800 $1,000 $1,200 Retail Price $1,400 $1,600 $1,800 $2,000 OVERALL (571) There are a few items of particular interest in this graphic. First, the differences in price ranges among the three shed types are called out. Although the differentiation between sheet metal and plastic sheds is fairly clear-up, there is quite a bit of overlap between plastic and wooden sheds. Second, the price cut points have been set at $200 increments to represent key perceptual price barriers (especially $1,000, where there appears to be a possible perceptual barrier in the minds of consumers). III. ACBC EXPERIMENTAL DESIGN AND TEST RESULTS This section describes the experimental design of the 2013 Storage Shed ACBC study, and the “research-on-research” question we attempt to answer. It also discusses the holdout task specification and the measures used to determine the precision of the conjoint model in its various test situations. Finally, the results of this experimental test are presented. Split-Sample Format to Test ACBC Designs The Shed study used a split-sample format with three Adaptive Choice design variants based on incrementally relaxed definitions of the “near-neighbor” concept in the Screener section of the ACBC protocol. We have characterized those three design variants as Version 1—Conservative departure from the respondent’s BYO concept, Version 2—Moderate departure, and Version 3— Aggressive departure. (See Figure 12) 28 Figure 12 Conservative to Aggressive ACBC Design Strategies ACBC Design Strategy: Near-Neighbors instead of “Full Factorial” Total multivariate attribute space (9 attributes) with nearly 7,800 unique product combinations, plus a virtually infinite number of prices) BYO Shed Configuration (respondent’s “ideal” shed) VERSION 3 / “Aggressive” Vary 4-5 attributes from BYO concept per task (n=228) VERSION 2 / “Moderate” Vary 3-4 attributes from BYO concept per task (n=210) VERSION 1 / “Conservative” Vary 2-3 attributes from BYO concept per task (n=205) Adapted from Orme’s 2008 ACBC Beta Test Instructional Materials Each qualified panelist was assigned randomly to one of the three questionnaire versions. As a matter of course, we verified that the demographic and product/purchase profiles of each of the three survey samples were similar (i.e., gender, age, home ownership, shed ownership, shed purchase likelihood, type of shed owned or likely to purchase, and preferred store for a shed purchase). Going into this experiment, we had several expectations regarding the outcome. First, we recognized that Version 1—Conservative would be the least-efficient experimental design, because it defined “near neighbor” very closely, and therefore the conjoint choice tasks would include only product configurations very close to the BYO starting point (only 2 to 3 attributes were varied from the BYO-selected concept to generate additional product concepts). At the other end of the spectrum, Version 3—Aggressive would have the widest array of product configurations (varying from 4 to 5 of the attributes from the respondent’s BYO-selected concept to generate additional concepts), resulting in a more efficient design. This is borne out by Defficiency calculations provided by Bryan Orme of Sawtooth Software using the results of this study. As shown in Figure 13, the design of the Version 3 conjoint experiment was 27% more efficient than that of the Version 1 experiment. 29 Figure 13 Calculated D-efficiency of Design Versions D-Efficiency 0.44 Version 2 (3-4 changes from BYO) 0.52 Version 3 (4-5 changes from BYO) 0.56 Increasing D-efficiency Version 1 (2-3 changes from BYO) Index 100 118 127 Calculations courtesy of Bryan Orme, using Sawtooth CVA and ACBC Version 8.2 Despite the statistical efficiency advantage of Version 3, we fully expected Version 1 to provide the most accurate results. In thinking about the flow of the interview, we felt Version 1 would be the most user-friendly for a respondent, since most of the product configurations shown in the Screening section would be very close to his/her BYO specification. The respondent would feel that the virtual interviewer (“Kylie” in this case) is paying attention to his/her preferences, and therefore would remain more engaged in the interview process and (presumably) be more consistent in answering the holdout choices. In contrast, the wider array of product configurations from the more aggressive Version 3 approach might be so far afield that the panelist would feel the interviewer is not paying as much attention to previous answers. As a result, s/he might become frustrated and uninvolved in the process, thereby registering lessreliable utility scores. One of the by-products of this test was the expectation that those participating in the Version 1 questionnaire would see relatively more configurations they liked and would therefore bring forward more options (“possibilities”) into the Tournament section. Others answering the Version 3 questionnaire would see fewer options they liked and therefore would bring fewer options forward into the Tournament. As shown in Figure 14, this did indeed happen, with a slightly (but significantly) larger average number of conjoint concepts being brought forward in Version 1 than in Version 3. 30 Figure 14 Distribution of Concepts Judged to be “a Possibility” Lifetime Shed Conjoint Study 2013 Distribution of Shed Concepts Judged to be "a Possibility" Survey Sampling Inc. - Nationwide n=643 100 Differences in mean number of “possibilities” among 3 ACBC versions are significant (P=.034) 90 80 Cumulative Percent 70 Version 3 / Aggressive: Smaller numbers of “possibilities” 60 50 Version 1 / Conservative: Larger numbers of “possibilities” 40 30 20 10 0 0 1 2 3 4 5 Version 1 (Mean=15.7) 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Number of Shed Concepts judged to be "a Possibility" (Max=28) Version 2 (Mean=15.0) Version 3 (Mean=14.3) TOTAL (Mean=15.0) Validation Using In-sample Holdout Questions To validate the results of this survey experiment we used four in-sample holdout questions containing three product concepts each. The same set of holdouts was administered to all respondents in each of the three versions of the survey instrument. We also used a test-retest verification procedure, where the same set of four questions was repeated immediately, but with the order of presentation of concepts in each question scrambled. A summary of this Holdout Test-Retest procedure is included in the table below: In order to maximize the reality of the holdouts, we generated the product configuration scenarios using real-life product options and prices (including extensive level overlap, in some cases). Of the 12 concepts shown, five were plastic sheds (three based on current Lifetime models and the other two based on typical competitor offerings), four were wooden sheds (competitors), and three were sheet metal sheds (competitors). In an effort to test realistic level overlap, the first holdout task (repeated in scrambled order in the fifth task) contained a Lifetime plastic shed and a smaller wooden shed, both with the same retail price. Likewise, in the second 31 (and sixth) task Lifetime and one of its key plastic competitors were placed head-to-head. In keeping with marketplace realities, both of these models were very similar, with the Lifetime offering having a $100 price premium to account for three relatively minor product feature upgrades. (As will be seen shortly, this scenario made for a difficult decision for many respondents, and they were not always consistent during the test-retest reality check.) To illustrate these holdout task-design considerations, two examples are shown below: Figure 15 (which compares Holdout Tasks #1 and #5) and Figure 16 (which compares Holdout Tasks #2 and #6). Figure 15 In-Sample Holdout Tasks #1 & #5 In-Sample Holdout Tasks #1 & #5 Lifetime 8x10 Shed Task #1 (First Phase) TASK #1 Version 1 Version 2 Version 3 TOTAL Concept 1 19.5% 16.2% 11.4% 15.6% RJG 9/17/13 Revision Concept 2 22.9% 20.0% 23.2% 22.1% Task #5 (Second Phase/Scrambled) Concept 3 57.6% 63.8% 65.4% 62.4% TASK #5 Version 1 Version 2 Version 3 TOTAL Concept 1 21.5% 20.0% 20.6% 20.7% Concept 2 61.0% 64.8% 64.0% 63.3% Concept 3 17.6% 15.2% 15.4% 16.0% 32 This set of holdout concepts nominally had the most accurate test-retest results of the four— and provided the best predictive ability for the conjoint model as well. Note that the shares of preference for the Lifetime 8x10 Shed are within one percentage point of each other. Also, note that the Lifetime shed was heavily preferred over the comparably priced (but smaller and more sparsely featured) wooden shed. 32 Figure 16 In-Sample Holdout Tasks #2 & #6 In-Sample Holdout Tasks #2 & #6 Lifetime 7x7 Shed Task #2 (First Phase) Task #6 (Second Phase/Scrambled) Close competitor to Lifetime 7x7 TASK #2 Version 1 Version 2 Version 3 TOTAL Concept 1 19.5% 25.7% 27.6% 24.4% Concept 2 38.0% 35.7% 32.9% 35.5% Concept 3 42.4% 38.6% 39.5% 40.1% TASK #6 Version 1 Version 2 Version 3 TOTAL Concept 1 36.1% 31.9% 32.5% 33.4% Concept 2 30.2% 31.9% 31.1% 31.1% Concept 3 33.7% 36.2% 36.4% 35.5% RJG 9/17/13 Revision 33 Holdouts 3 & 7 &and 4 &the 8 had moderately reliable replication rates of the four— This set of holdout concepts nominally had least accurate test-retest results and provided the worst predictive ability for the conjoint model. Note that the shares of preference for the Lifetime 7x7 Shed and those of the competitor 7x7 shed varied more between the Test and Retest phases than the task in Figure 15. It is also interesting to note that the competitor shed appeared to pick up substantial share of preference in the Retest phase when both products are placed side-by-side in the choice task. This suggests that, when the differences are well understood, consumers may not evaluate the $100 price premium for the more fully featured Lifetime shed very favorably. Test Settings and Conditions Here are the key settings and conditions of the Shed conjoint experiment and validation: Randomized First Choice simulation method: o Scale factor (Exponent within the simulator) adjusted within version to minimize errors of prediction (Version 1 = 0.55, Version 2 = 0.25, Version 3 = 0.35) o Root Mean Square Error (rather than Mean Absolute Error) used in order to penalize extreme errors Piecewise price function with eight segments, with the price function constrained to be negative Hit Rates used all eight holdout concepts (both Test and Retest phases together) 33 Deleted 72 bad cases prior to final validation: o “Speeders” (less than four minutes) and “Sleepers” (more than one hour) o Poor Holdout Test-Retest consistency (fewer than two out of four tasks consistent) o Discrimination/straight-line concerns (panelist designated no concepts or all 28 concepts as “possibilities” in the Screening section) Note that the cumulative impact of all these adjustments on Hit Rates was about +4 to +5 percentages points (i.e., moved from low 60%’s to mid 60%’s, as shown below). The validation results for each survey version—and each holdout set—are provided in Figure 17. In each case, the Root Mean Square Error and Hit Rate are reported, along with the associated Test-Retest Rate (TRR). Figure 17 Summary of Validation Results Root MSE Hit Rate Test-Retest Rate Version 1 / Conservative 4.4 64% 82% Version 2 / Moderate 6.5 65% 81% Version 3 / Aggressive 6.5 69% 85% Holdouts 1 & 5 5.2 75% 87% Holdouts 2 & 6 4.9 54% 73% Holdouts 3 & 7 3.8 68% 85% Holdouts 4 & 8 8.6 66% 85% OVERALL 5.9 66% 83% Across All 4 Sets of Holdouts Across All 3 Questionnaire Versions Here are some observations regarding the results on this table: 34 Overall, the Relative Hit Rate (RHR) was about as expected (66% HR / 83% TRR = 80% RHR). The nominally increasing hit rate from Version 1 to Version 3 was not expected (we had expected it to decrease). There was a general lack of consistency between Root MSEs and Hit Rates, suggesting a lack of discernible impact of ACBC design (as tested) on precision of estimation. Holdouts #2 & #6 were the most difficult for respondents to deal with, with a substantially lower Hit Rate and Test-Retest Rate (but, interestingly, not the highest RMSE!). In order to determine statistically whether adjustments in the ACBC design had a significant impact on ability to predict respondents’ choices, we generated the following regression model, wherein we tested for significance in the two categorical (dummy) variables representing incremental departures from the near-neighbor ACBC base case. We also controlled for overall error effects, as measured by the holdout Test-Retest Rate. (See Figure 18) Figure 18 Shed ACBC Hit Rate Model Shed ACBC Hit Rate Model ? Hit Rate = (ACBC Version, Test-Retest Rate) Empirical Regression Model: HR {0-1} = .34 + .015 (V2 {0,1}) + .040 (V3 {0,1} + .36 TRR {0-1} Note Constants: V1 (Conservative) = .34; V2 (Moderate) = .355; V3 (Aggressive) = .38 Model Significance Overall (F) P=.000 Adjusted R2=0.070 V2 (Dummy) Coefficient (T) P=.568 V3 (Dummy) Coefficient (T) P=.114 Test-Retest Coefficient (T) P=.000 9/17/13 HereRJGare ourRevision observations on this regression model: 36 Hit Rates increased only 1.5% and 4.0% from Version 1 (base case) to Versions 2 and 3, respectively. Neither one of these coefficients was significant at the .05 level. The error-controlling variable (Test-Retest Rate) was significant—with a positive coefficient—suggesting that hit rates go up as respondents pay closer attention to the quality and consistency of their responses. Of course, the overall model is significant, but only because of Test-Retest controlling variable. Questionnaire version (i.e., aggressiveness of ACBC design) does NOT have a significant impact on Hit Rates KEY TAKEAWAYS Validation procedures verify that the overall predictive value of the 2013 Storage Shed ACBC Study is reasonable. The overall Relative Hit Rate was .80 (or .66 / .83), implying that model predictions are 80% as good as the Test-Retest rate. Root MSEs were about 6.0—with corresponding MAEs in the 4–5 range—which is generally similar to other studies of this nature. 35 The evidence does not support the notion that Version 1 (the current, conservative ACBC design) provides the most valid results. Even controlling for test-retest error rates, there was no statistical difference in hit rates among the three ACBC design approaches. (In fact, if there were an indication of one design being more accurate than the other, one might argue that it could be in favor of Version 3, the most aggressive approach, with its nominally positive coefficient.) While this apparent lack of differentiation in results among the three ACBC designs could be disappointing from a theoretical point of view, there are some positive implications: For Lifetime Products: Since differences in predictive ability among the three test versions were not significant, we can combine data sets (n=571) for better statistical precision for client simulation applications. For Sawtooth Software—and the Research Community in general: The conclusion that there are no differences in predictive ability, despite using a variety of conjoint design settings, could be a good “story to tell” about the robustness of ACBC procedures in different design settings. This is especially encouraging, given the prospect of even more-improved design efficiencies with the upcoming Version 8.3 of ACBC. Lifetime Products’ experiences and learnings over the past few years suggest several key takeaways, particularly for new practitioners of conjoint analysis and other quantitative marketing tools. Continue to explore and experiment with conjoint capabilities and design options. Look for new applications for conjoint-driven market simulations. Continuously improve your conjoint capabilities. Don’t let it get too routine! Treat your conjoint work with academic rigor. Robert J. Goodwin 36 REFERENCES Goodwin, Robert J: Introduction of Quantitative Marketing Research Solutions in a Traditional Manufacturing Company: Practical Experiences. Proceedings of the Sawtooth Software Conference, March 2009, pp. 185–198. Goodwin, Robert J: The Impact of Respondents’ Physical Interaction with the Product on Adaptive Choice Results. Proceedings of the Sawtooth Software Conference, October 2010, pp. 127–150. Johnson, Richard M., Orme, Bryan K., Huber, Joel & Pinnell, Jon: Testing Adaptive ChoiceBased Conjoint Designs, 2005. Sawtooth Software Research Paper Series. Orme, Bryan K., Alpert, Mark I. & Christensen, Ethan: Assessing the Validity of Conjoint Analysis—Continued, 1997. Sawtooth Software Research Paper Series. Orme, Bryan K.: Fine-Tuning CBC and Adaptive CBC Questionnaires, 2009. Sawtooth Software Research Paper Series. Special acknowledgement and thanks to: Bryan Orme (Sawtooth Software Inc.) Paul Johnson, Tim Smith & Gordon Bishop (Survey Sampling International) Chris Chapman (Google Inc.) Clint Morris & Vince Rhoton (Lifetime Products, Inc.) 37 CAN CONJOINT BE FUN?: IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS JANE TANG ANDREW GRENVILLE VISION CRITICAL SUMMARY Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the number of respondents for Choice Based Conjoint (CBC) studies in the era of on-line panels. The results showed that respondents become less engaged in later tasks. Increasing the number of choice tasks brought limited improvement in the model’s ability to predict respondents’ behavior, and actually decreased model sensitivity and consistency. In 2012, we looked at how shortening CBC exercises impacts the individual level precision of HB models, with a focus on the development of market segmentation. We found that using a slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes. However, a CBC exercise with only 8 to 10 tasks is still considered boring by many respondents. In this paper, we looked at two ideas that may be useful in improving respondents’ enjoyment level: 1. Augmenting the conjoint exercise using adaptive/tournament based choices. 2. Sharing the results of the conjoint exercise. Both of these interventions turn out to be effective, but in different ways. The adaptive/tournament tasks make the conjoint exercise less repetitive, and at the same time provide a better model fit and more sensitivity. Sharing results has no impact on the performance of the model, but respondents did find the study more “fun” and more enjoyable to complete. We encourage our fellow practitioners to review conjoint exercises from the respondent’s point of view. There are many simple things we can do to make the exercise appealing, and perhaps even add some “fun.” While these new approaches may not yield better models, simply giving the respondent a more enjoyable experience, and by extension making him a happier panelist, (and one who is less likely to quit the panel) would be a goal worth aiming for. 1. INTRODUCTION In the early days of CBC, respondents were often recruited into “labs” to complete questionnaires, either on paper or via a CAPI device. They were expected to take up to an hour to complete the experiment and were rewarded accordingly. The CBC tasks, while more difficult to complete than other questions (e.g., endless rating scale questions), were considered interesting by the respondents. Within the captive environment of the lab, respondents paid attention to the attributes listed and considered tradeoffs among the alternatives. Fatigue still crept in, but not until after 20 or 30 such tasks. 39 Johnson & Orme (1996) was the earliest paper the authors are aware of to address the suitable length of a CBC experiment. The authors determined that respondents could answer at least 20 choice tasks without degradation in data quality. Hoogerbrugge & van der Wagt (2006) was another paper to address this issue. It focused on holdout task choice prediction. They found that 10–15 tasks are generally sufficient for the majority of studies. The increase in hit rates beyond that number was minimal. Today, most CBC studies are conducted online using panelists as respondents. CBC exercises are considered a chore. In the verbatim feedback from our panelists, we see repeated complaints about the length and repetitiveness of choice tasks. Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the number of respondents in the era of on-line panels. The results showed that respondents became less engaged in later tasks. Therefore, increasing the number of choice tasks brought limited improvement in the model’s ability to predict respondents’ behavior, and actually decreased model sensitivity and consistency. In 2012, we looked at how shortening CBC exercises affected the individual-level precision of HB models, with a focus on the development of market segmentation. We found that using a slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes. However, a CBC exercise with only 8 to 10 tasks is still considered boring by many respondents. The GreenBook blog noted this, citing CBC tasks as number four in a list of the top ten things respondents hate about market research studies. 40 http://www.greenbookblog.org/2013/01/28/10-things-i-hate-about-you-by-mr-r-e-spondent/ 2. WHY “FUN” MATTERS? An enjoyable respondent survey experience matters in two ways: Firstly, when respondents are engaged they give better answers that show more sensitivity, less noise and more consistency. In Suresh & Conklin (2010), the authors observed that faced with the same CBC exercise, those respondents who received the more complex brand attribute section chose “none” more often and had more price order violations. In Tang & Grenville (2010), we observed later choice tasks result in more “none” selections. When exposed to a long choice task, the respondents’ choices contained more noise, resulting in less model sensitivity and less consistency (more order violations). Secondly, today a respondent is often a panelist. A happier respondent is more likely to respond to future invites from that panel. From a panelist retention point of view, it is important to ensure a good survey experience. We at Vision Critical are in a unique position to observe this dynamic. Vision Critical’s Sparq software enables brands to build insight communities (a.k.a. brand panels). Our clients not only use our software, but often sign up for our service in recruiting and maintaining the panels. From a meta analysis of 393 panel satisfaction surveys we 41 conducted for our clients, we found that “Survey Quality” is the Number Two driver of panelist satisfaction, just behind “your input is valued” and ahead of incentives offered. Relative Importance of panel service attributes: The input you provide is valued 16% The quality of the studies you receive 15% The study topics 15% The incentives offered by the panel 13% The newsletters / communications that you receive 12% The look and feel of studies 9% The length of each study 8% The frequency of the studies 8% The amount of time given to respond to studies 6% There are many aspects to survey quality, not the least of which is producing a coherent and logical survey instrument/questionnaire and having it properly programmed on a webpage. A “fun” and enjoyable survey experience also helps to convey the impression of quality. 3. OUR IDEAS There are many ways a researcher can create a “fun” and enjoyable survey. Engaging question types that make use of rich media tools can improve the look and feel of a webpage on which the question is presented, and make it easier for the respondent to answer those questions. Examples of that can be found in Reid et al. (2007). Aside from improving the look, feel and functionality of the webpages, we can also change how we structure the questions we ask to make the experience more enjoyable. Puleson & Sleep’s (2011) award winning ESOMAR congress paper gives us two ideas. 42 The first is introducing a game-playing element into our questioning. In the context of conjoint experiments, we consider how adaptive choice tasks could be used to achieve this. We can structure conjoint tasks to resemble a typical game, so the tasks become harder as one progresses through the levels. Orme (2006) showed how this could be accomplished in an adaptive MaxDiff experiment. A MaxDiff experiment is where a respondent is shown a small set of options, each described by a short description, and asked to choose the option he prefers most as well as the option he prefers least. In a traditional MaxDiff, this task is followed by many more sets of options with all the sets having the same number of options. In an Adaptive MaxDiff experiment, this series of questioning is done in stages. While the respondents see the traditional MaxDiff tasks in the first stage, those options chosen as preferred “least” in stage 1 are dropped off in stage 2. The options chosen as preferred “least” in stage 2 are dropped off in stage 3, etc. The numbers of options used in the comparison in each stage get progressively smaller, so there are changes in the pace of the questions. Respondents can also see how their choices result in progressively more difficult comparisons. At the end, only the favorites are left to be pitted against each other. Orme (2006) showed that respondents thought this experience was more enjoyable. This type of adaptive approach is also at work in Sawtooth Software’s Adaptive CBC (ACBC) product. The third step in an ACBC experiment is a choice tournament based on all the product configurations in a respondent’s consideration set. Tournament Augmented Conjoint (TAC) has been tried before by Chrzan & Yardly (2009). In their paper, the authors added a series of tournament tasks to the existing CBC tasks. However, as the CBC section was already quite lengthy, with accurate HB estimates, the authors concluded that the additional TAC tasks provided only modest and non-significant improvements, which did not justify the extra time it took to complete the questionnaire. However, we hypothesize that if we have a very short CBC exercise and make the tournament tasks quite easy (i.e., pairs), the tournament tasks may bring more benefits, or at least be more enjoyable for the panelists. Our second idea comes from Puleson & Sleep (2011), who offered respondents a “two-way conversation.” From Vision Critical’s panel satisfaction research, we know that people join panels to provide their input. While respondents feel good about the feedback they provide, they want to know that they have been heard. Sharing the results of studies they have completed is a tangible demonstration that their input is valued. Most panel operators already do this, providing feedback on the survey results via newsletters and other engagement tools. However, we can go further. News media websites often have quick polls where they pose a simple question to anyone visiting the website. As soon as a visitor provides her answer, she can see the results from all the respondents thus far. That is an idea we want to borrow. Dahan (2012) showed an example of personalized learning from a conjoint experiment. A medical patient completed a series of conjoint tasks. Once he finished, he received the results outlining his most important outcome criterion. This helped the patient to communicate his needs and concerns to his doctors. It could also help him make future treatment decisions. Something like this could be useful for us as well. 43 4. FIELD EXPERIMENT We chose a topic that is of lasting interest to the general population: dating. We formulated a conjoint experiment to determine what women were looking for in a man. Cells The experiment was fielded in May 2013 in Canada, US, UK and Australia. We had a sample size of n=600 women in each country. In each country, respondents were randomly assigned into one of four experimental cells. CBC (8 Tasks, Triples) CBC (8 Tasks, Triples) + Shareback CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs) CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs) + Shareback n= 609 623 613 618 While the CBC only cells received 8 choice tasks, all of them were triples. The Tournament cells had 9 tasks, 5 triples and 4 pairs. The amount of information, based on the number of alternatives seen by each respondent, was approximately the same in all the cells. We informed the respondents in the Shareback cells at the start of the interview that they would receive the results from the conjoint experiment after it was completed. Questionnaire The questionnaire was structured as follows: 1. Intro/Interest in topic 2. All about you: Demos/Personality/Preferred activity on dates 3. What do you look for in a man? o Personality 44 o BYO: your ideal “man” 4. Conjoint Exercise per cell assignment 5. Share Back per cell assignment 6. Evaluation of the study experience A Build-Your-Own (BYO) task in which we asked the respondents to tell us about their ideal “man” was used to educate the respondents on the factors and levels used in the experiment. Tang & Grenville (2009) showed that a BYO task was effective in preparing respondents for making choice decisions. Vision Critical’s standard study experience module was used to collect the respondents’ evaluation data. This consisted of 4 attribute ratings measured on a 5-point agreement scale, and any volunteered open-ended verbatim comments on the study topic and survey experience. The four attribute ratings were: Overall, this survey was easy to complete I enjoyed filling out this survey I would fill out a survey like this again The time it took to complete the survey was reasonable Factors & Levels The following factors were included in our experiment. Note that body type images are used in the BYO task only. Attribute: Level: Level: Level: Level: Level: Age Height Much older than me, Much taller than me Big & Cuddly A bit older than me A little taller than me Big & Muscly About the same age Same height as me Athletic & Sporty A bit younger than me Shorter than me Lean & Fit Much younger than me images used at the BYO question only, not in conjoint task Body Type Career Activity Attitude towards Family/Kids Personality Flower Scale Yearly Income Notes: Driven to succeed and make Works hard, but with a good Has a job, but it's only to pay money work/life balance the bills Prefers day to day life over Exercise fanatic Active, but doesn't overdo it exercise Happy as a couple Wants a few kids Wants a large family Reliable & Practical Flowers, even when you are not expecting Pretty low Under $50,000 Under $30,000 Under £15,000 Funny & Playful Flowers for the important occasions Low middle $50,000 to $79,999 $30,000 to $49,999 £15,000 – £39,999 Sensitive & Empathetic Flowers only when he’s saying sorry Middle $80,000 to $119,999 $50,000 to $99,999 £40,000 – £59,999 Prefers to find work when he needs it Serious & Determined Passionate & Spontaneous "What are flowers?" High middle $120,000 to $159,999 $100,000 to $149,999 £60,000 – £99,999 Really high $160,000 or more $150,000 or more £100,000 or more Australia US/Canada UK 45 Screen Shots The A CBC task was presented to the respondent as follow: The adaptive/tournament tasks were formulated as follows: Set 1 Set 2 Set 3 Set 4 46 Randomly order the 5 winners from the CBC tasks, label them as item 1 to item 5. Item 1 Item 3 Item 5 winner from Set 2 Item 2 Item 4 Winner from Set 1 Winner from Set 3 Drop the loser from Set 1 Drop the loser from Set 2 Drop the loser from Set 3 The tournament task was shown as: The personalized learning page was shown as follows: Personalized learning was based on frequency count only. For each factor, we counted how often each level was presented to that respondent and how often it was chosen when presented. The most frequently chosen level was presented back to the respondents. These results were presented mostly for fun—the counting analysis was not the best for providing this kind of individual feedback. The actual profiles presented to each individual respondent in her CBC tasks were not perfectly balanced; the Tournament cells, where the winners were presented to each respondent, would also have added bias for the counting analysis. If we wanted to focus on getting accurate individual results, something like an 47 individual level logit model would be preferred. However, here we felt the simple counting method would be sufficient and it was easy for our programmers to implement. Each respondent in the Shareback cells was also shown the aggregate results from her fellow countrywomen who had completed the survey thus far in her experiment cell. 5. RESULTS We built HB models for each of the 4 experimental cells separately. Part-worth utilities were estimated for all the factors. Sawtooth Software’s CBC/HB product was used for the estimation. Model Fit/Hit Rates We deliberately did not design holdout tasks for this study. We wanted to measure the results of making a study engaging, and using holdout tasks makes the study take longer to complete, which tends to have the opposite effect. Instead of purposefully designed holdout tasks, we randomly held out one of the CBC tasks to measure model fit. Since respondents tend to spend much longer time at their first choice task, we decided to exclude the 1st task for this purpose. For each respondent, one of her 2nd, 3rd, 4th and 5th CBC tasks was randomly selected as the holdout task. The hit rates for the Tournament cells (63%) were much higher than the CBC cells (54%). That result was surprising at first, since we would expect no significant improvement in model performance for the Tournament cells. However, while the randomly selected holdout task was held out from the model, the winner from that task was still included in the tournament tasks, which may explain the increased performance. In order to avoid any influence of the random holdout task, we reran the models for the tournament cells again, holding out information from the holdout task itself, and any tournament tasks related to its winner. The new hit rates (52%) are now comparable to that of the CBC cells. 48 However, by holding out not only the selected random holdout task, but also at least one and potentially as many as three out of the four tournament tasks, we might have gone too far in withholding information from the modeling. Had full information been used in the modeling, we expect the tournament cells would have a better model fit and be better able to predict respondent’s choice behavior. Respondents seem to agree with this. Those who participated in the tournament thought we did a better job of presenting them their personalized learning information. While this information is based on a crude counting analysis and has potential bias issues, it is still comforting to see this result. The improvement in model fit is also reflected in a higher scale parameter in the model, with the tournament cells showing stronger preferences, i.e., less noise and higher sensitivity. The graph below shows the simulated preference shares for a “man” for each factor at the designated level one at a time (holding all other factors at neutral). The shares are rescaled so that the average across the levels within each factor sums to 0. 49 “Fun” & Enjoyment Respondents had a lot of fun during this study. The topbox ratings for all 4 items track much higher than the ratings for the congressional politics CBC study used in our 2010 experiment. Disappointingly, there are not any differences across the 4 experimental cells among these ratings. We suspect this is due to the high interest in the topic of dating and the fact we went out of way to make the experience a good one for all the cells. Had we tested these interventions in a less interesting setting (e.g., smartphone), we think we would have seen larger effects. Interestingly, we saw significant differences in the volunteered open-ended verbatim answers from respondents. Many of these verbatim answers are about how they enjoyed the study experience and had fun completing the survey. Respondents in the Shareback cells volunteered more comments and more “fun”/enjoyment comments than the non-shareback cells. 50 While an increase of 6.7% to 9.0% appears to be only a small improvement, given that only 13% of the respondents volunteered any comments at all across the 4 cells, this reflects a sizeable change. 6. CONCLUSIONS & RECOMMENDATION Both of these interventions are effective, but in different ways. The adaptive/Tournament tasks make the conjoint exercise less repetitive and less tedious, and at the same time provide better model fit and more sensitivity in the results. While sharing results has no impact on the performance of the model, the respondents find the study more fun and more enjoyable to complete. Should we worry about introducing bias with these methods? The answer is no. Adaptive methods have been shown to give results consistent with the traditional approaches in many different settings, both for Adaptive MaxDiff (Orme 2006) and numerous papers related to ACBC. Aside from the scale difference, our results from the Tournament cells are also consistent with that from the traditional CBC cells. Advising respondents that we would share back the results of the findings also had no impact on their choice behaviors. We encourage fellow practitioners to review conjoint exercises from the respondent’s point of view. There are many simple things we can do to make the exercise appealing, and perhaps even add “fun.” While these new approaches may not yield better models, simply giving the respondent a more enjoyable experience, and by extension making him a happier panelist, would be a goal worth aiming for. In the words of a famous philosopher: 51 While conjoint experiments may not be enjoyable by nature, there is no reason respondents cannot have a bit of fun in the process. Jane Tang REFERENCES Chrzan, K. & Yardley, D. (2009), “Tournament-Augmented Choice-Based Conjoint” Sawtooth Software Conference Proceedings. Dahan, E. (2012), “Adaptive Best-Worst Conjoint (ABC) Analysis” Sawtooth Software Conference Proceedings. Hoogerbrugge, M. and van der Wagt, K. (2006), “How Many Choice Tasks Should We Ask?” Sawtooth Software Conference Proceedings. Johnson, R. and Orme, B. (1996), “How Many Questions Should You Ask In Choice-Based Conjoint Studies?” ART Forum Proceedings. Orme, B. (2006), “Adaptive Maximum Difference Scaling” Sawtooth Software Technical Paper Library Puleson, J. & Sleep, D. (2011), “The Game Experiments: Researching how gaming techniques can be used to improve the quality of feedback from on-line research” ESOMAR Congress 2011 Proceedings 52 Reid, J., Morden, M. & Reid, A. (2007) “Maximizing Respondent Engagement: The Use of Rich Media” ESOMAR Congress Full paper can be downloaded from http://vcu.visioncritical.com/wpcontent/uploads/2012/02/2007_ESOMAR_MaximizingRespondentEngagement_ORIGINAL -1.pdf Suresh, N. and Conklin, M. (2010), “Quantifying the Impact of Survey Design Parameters on Respondent Engagement and Data Quality” CASRO Panel Conference. Tang, J. and Grenville, A. (2009), “Influencing Feature Price Tradeoff Decisions in CBC Experiments,” Sawtooth Software Conference Proceedings. Tang, J. & Grenville, A. (2010), “How Many Questions Should You Ask in CBC Studies?— Revisited Again” Sawtooth Software Conference Proceedings. Tang, J. & Grenville, A. (2012), “How Low Can You Go?: Toward a better understanding of the number of choice tasks required for reliable input to market segmentation” Sawtooth Software Conference Proceedings. 53 MAKING CONJOINT MOBILE: ADAPTING CONJOINT TO THE MOBILE PHENOMENON CHRIS DIENER1 RAJAT NARANG2 MOHIT SHANT3 HEM CHANDER4 MUKUL GOYAL5 ABSOLUTDATA INTRODUCTION: THE SMART AGE With “smart” devices like smartphones and tablets integrating the “smartness” of personal computers, mobiles and other viewing media, a monumental shift has been observed in the usage of smart devices for information access. The sales of smart devices have been estimated to cross the billion mark in 2013. The widespread usage of these devices has impacted the research world too. A study found that 64% of survey respondents preferred smartphone surveys, 79% of them preferring to do so due to the “on-the-go” nature of it (Research Now, 2012). Multiple research companies have already started administering surveys for mobile devices, predominantly designing quick hit mobile surveys to understand the reactions and feedback of consumers, onthe-go. Prior research (“Mobile research risk: What happens to data quality when respondents use a mobile device for a survey designed for a PC,” Burke Inc, 2013) has suggested that when comparing the results of surveys adapted for mobile devices to those on personal computers, respondent experience is poorer and data quality is comparable for surveys on mobile and personal computers. This prior research also discourages the use of complex research techniques like conjoint on the mobile platform. This comes as no surprise, as conjoint has long been viewed as a complex and slightly monotonous exercise from the respondent’s perspective. Mobile platform’s small viewer interface and internet speed can act as potential barriers for using conjoint. ADAPTING CONJOINT TO THE MOBILE PLATFORM Recognizing the need to reach respondents who are using mobile devices, research companies have introduced three different ways of conducting surveys on mobile platforms— web browsers based, app based and SMS based. Of these three, web browser is the most widely used, primarily due to the limited customization required to host the surveys simultaneously on mobile platforms and personal computers. The primary focus of mobile-platform-based surveys 1 Senior Vice President, AbsolutData Intelligent Analytics [Email: [email protected]] Senior Expert, AbsolutData Intelligent Analytics [Email: [email protected]] 3 Team Lead, AbsolutData Intelligent Analytics [Email: [email protected]] 4 Senior Analyst, AbsolutData Intelligent Analytics [Email: [email protected]] 5 Senior Programmer, AbsolutData Intelligent Analytics [Email: [email protected]] 2 55 is short and simple surveys like customer satisfaction, initial product reaction, and attitude and usage studies. However, currently the research industry is hesitant to conduct conjoint studies on mobile platform due to concerns with: Complexity—conjoint is known to be complex and intimidating exercise due to the number of tasks and the level of detail shown in the concepts Inadequate representation on the small screen—having large number and long concepts on the screen can affect readability Short attention span of mobile users Possibility of a conjoint study with large number of attributes and tasks—if a large number of attributes are being used, there is a possibility of the entire concept not being shown on a single screen, requiring a user to scroll Penetration of smartphones in a region In this paper, we hypothesize that all these can be countered (with the exception of smart phone penetration) by focusing on improving the aesthetics and simplifying the conjoint tasks, as illustrated in Figure 1. Our changes include: 56 Improving aesthetics o Coding the task outlay to optimally use the entire screen space o Minimum scrolling to view the tasks o Reduction of number of concepts being shown on a screen Simplifying conjoint tasks o Reduction in number of tasks o Reduction of number of attributes on a screen o Using simplified conjoint methodologies Figure 1 We improve aesthetics using programming techniques. To simplify conjoint tasks and make them more readable to the respondents, we customize several currently used techniques to adapt them to the mobile platform and compare their performance. CUSTOMIZING CURRENT TECHNIQUES Shortened ACBC Similar to ACBC, this method uses pre-screening to identify most important attributes for each respondent. The attributes selected in the pre-screening stage qualify through to the Build Your Own (BYO) section and the Near Neighbor section. We omitted the choice tournament in order to reduce the number of tasks being evaluated. ACBC is known to present more simplistic tasks and better respondent engagement by focusing on high priority attributes. We further simplified the ACBC tasks by truncating the list of attributes and hence reducing the length of the concepts on the screen. Also, the number of concepts per screen was reduced to 2 to simplify the tasks for respondents. An example is shown in Figure 2. 57 Figure 2 Shortened ACBC Screenshot Pairwise Comparison Rating (PCR) We customize an approach similar to the Conjoint Value Analysis (CVA) method. Similar to CVA, this method shows two concepts on the screen. We show respondents a 9 point scale and ask to indicate preference—4 points on the left/right indicating preference for product on the left/right respectively. Rating point in the middle implies that neither of the products is preferred. For the purpose of estimation, we convert the rating responses to: Discrete Choice—If respondents mark either half of the scale, the data file (formatted as CHO for utility estimation with Sawtooth Software) reports the concept as being selected. Whereas, if they mark the middle rating point, it implies that they have chosen the None option. Chip Allocation—We convert the rating given by the respondents to volumetric shares in the CHO file. In case of a partial liking of the concept (wherein the respondent marked 2/3/4 or 6/7/8), we allocate the rest of the share to the “none” option. So, for example, a rating of 3 would indicate “Somewhat prefer left,” so 50 points would go to left concept and 50 to none. Similarly, a rating of 4 would indicate 25:75 in favor of none. We include chip allocation to understand the impact of a reduced complexity approach on the results. Also, in estimating results using chip allocation methods, the extent of likeability of the product can also be taken into account (as opposed to single select in traditional CBC). We use two concepts per screen to simplify the tasks for respondents, as shown in Figure 3. 58 Figure 3 Pairwise Comparison Rating Screenshot CBC (3 concepts per screen) The 3-concept per screen CBC we employ is identical to CBC conducted on personal computers (Figure 4). We do this to compare the data quality across platforms for an identical method. Figure 4 CBC Mobile (3 concepts) Screenshot CBC (2 concepts per screen) Similar to traditional CBC, we also include a CBC with only 2 concepts shown per screen. This allows us to understand the result of direct CBC simplification, an example of which is shown in Figure 5. 59 Figure 5 CBC mobile (2 concepts) Screenshot Partial Profile This method is similar to traditional Partial Profile CBC where we fix the primary attributes on the screen and then rotated a set of secondary attributes. We further simplify the tasks by reducing the length of the concept by showing a truncated list of attributes and also by reducing the number of concepts shown per screen (2 concepts per screen) as is shown in Figure 6. Figure 6 Partial Profile Screenshot RESEARCH DETAILS The body of data collected to test our hypotheses and address our objectives is taken from quantitative surveys we conducted in the US and India. In each country, we surveyed 1200 60 respondents (thanks to uSamp and IndiaSpeaks for providing sample in US and India respectively). Each of our tested techniques was evaluated by 200 distinct respondents. Results were also gathered for traditional CBC (3 concepts per screen) administered on personal computer to compare as a baseline of data quality. The topic of the surveys was the evaluation of various brands of tablet devices (a total of 9 attributes were evaluated). In addition to including the conjoint tasks, the surveys also explored respondent reaction to the survey, as well as some basic demographics. The end of survey questions included: The respondent’s experience in taking the survey on mobile platforms Validity of results from the researcher’s perspective Evaluate the merits and demerits of each technique Evaluate if the efficacy of the techniques differ according to online survey conducting maturity of the region (lesser online surveys are conducted in India than in the US) RESULTS After removing the speeders and straightliners, we evaluate the effectiveness of the different techniques from two perspectives—the researcher perspective (technical) and respondent perspective (experiential). We compare the results for both countries side-by-side to highlight key differences. RESEARCHER’S PERSPECTIVE Correlation analysis Pearson correlations compare the utilities of each of the mobile methods with utilities of CBC on personal computer. The results of the correlations are found in Table 1. Most of the methods with the exception of PCR (estimated using chip allocation) show very high correlation and thus appear to mimic the results displayed by personal computers. This supports the notion that we are getting similar and valid results between personal computer and mobile platform surveys. 61 Table 1. Correlation analysis with utilities of CBC PC Holdout accuracy We placed fixed choice tasks in the middle and at the end of the exercise for each method. Due to the varying nature of the methods (varying number of concepts and attributes shown per task), the fixed tasks were not uniform across methods, i.e., each fixed task was altered as per the technique in question. For example, partial profile only had 6 attributes for 2 concepts versus a full profile CBC which had 9 attributes for 3 concepts and hence the fixed tasks were designed accordingly. As displayed in Table 2, Holdout task prediction rates are strong and in the typical expected range. All the methods customized for mobile platforms do better than CBC on personal computers (with the exception of PCR—Chip Allocation). CBC with 3 tasks, either on Mobile or on PC, did equally well. Table 2. Hit Rate Analysis Arrows indicate statistical significant difference from CBC PC 62 When we adjust these hit rates for the number of concepts presented, discounting the hit rates for tasks with fewer concepts1, the relative accuracy of the 2 versus the 3 concept tasks shifts. The adjusted hit rates are shown in Table 3. With adjusted hit rates, the 3 concept task gains the advantage over 2 concepts. We interpret these unadjusted and adjusted hit rates together to indicate that, by and large, the 2 and 3 concept tasks generate similar hit rates. Also in the larger context of comparisons, except for PCR, all of the other techniques are very comparable to CBC PC and do well. Table 3. Adjusted Hit Rate Analysis MAE MAE scores displayed in Table 4 tell a similar story with simplified methods like CBC (2 concepts) and Partial Profile doing better than CBC on personal computer. Table 4. MAE Analysis 1 We divided the hit rates with the probability of selection of each concept on a screen. E.g. for a task with two concepts and none option, the random probability of selection will be 33.33%. Therefore, all hit rates obtained for fixed tasks with two concepts were divided by 33.33% to get the index score 63 RESPONDENT’S PERSPECTIVE Average Time Largely, respondents took more time to evaluate conjoint techniques on the mobile platforms. As displayed in Chart 1, shortened ACBC takes more time for respondents to evaluate, particularly for the respondents from India. This is expected due to the rigorous nature of the method. PCR also took a lot of time to evaluate, especially for the respondents from India. This might indicate that a certain level of maturity is required from respondents for evaluation of complex conjoint techniques. This is reflective of the fact that online survey conduction is still at a nascent stage in India. Respondents took the least amount of time to evaluate CBC Mobile (2 concepts) indicating that respondents can comprehend simpler tasks quicker. Chart 1. Average time taken (in mins) Readability Respondents largely find tasks to be legible on mobile. This might be attributed to the reduced list of attributes being shown on the screen. Although, as seen in Chart 2, surprisingly, CBC Mobile (3 concepts) also does great on this front, which means that optimizing screen space on mobiles can go a long way in providing readability. 64 Chart 2. Readability of methods on PC/Mobile screens Arrows indicate statistical significant difference from CBC PC Ease of understanding Respondents found the concepts presented on the mobile platform easy to understand and the degree of understanding is comparable to conjoint on personal computers. Thus, conjoint research can easily be conducted on mobile platforms too. Chart 3. Readability of methods on PC/Mobile screen Arrows indicate statistically significant difference from CBC PC Enjoyability US respondents found the survey to be significantly less enjoyable than their Indian counterparts as displayed in Chart 4. This might be due to the fact that online survey market in US is quite saturated as compared to Indian market, which is still nascent. Therefore, respondent exposure to online surveys might be significantly higher in the US contributing to the low enjoyability. 65 Chart 4. Enjoyability of methods on PC/Mobile screen Arrows indicate statistical significant difference from CBC PC Encouragement to give honest opinion Respondents find that all the methods encouraged honest opinions in the survey. Chart 5. Encouragement to give honest opinions Arrows indicate statistical significant difference from CBC PC Realism of tablet configuration Respondents believe that the tablet configurations are realistic. As seen in Chart 6, all the methods are more or less at par with CBC on personal computers. This gives us confidence in the results because the same tablet configuration was used in all techniques. 66 Chart 6. Realism of tablet configuration Arrows indicate statistical significant difference from CBC PC SUMMARY OF RESULTS On the whole, all of the methods we customized for the mobile platform did very well in providing good respondent engagement and providing robust data quality. Although conjoint exercises with 3 concepts perform well on data accuracy parameters, they don’t bode well as far as the respondent experience is concerned. However, the negative effect on respondent experience can be mitigated by optimal use of screen space and resplendent aesthetics. Our findings indicate that a conjoint exercise with 2 concepts is the best of the alternative methods we tested in terms of enriching data quality as well as user experience. CBC with 2 concepts performs exceptionally well in providing richer data quality and respondent engagement than CBC on personal computers. The time taken to complete the exercise is also at par with that of CBC on PC. PCR (discrete estimation) does fairly well too. However, its practical application might be debated, with other methods being equally robust than it, if not more, and easier to implement. One may consider lowering the number of attributes being shown on the screen in conjunction with the reduction of the number of concepts by the usage of partial profile and shortened ACBC exercise. Although these methods score high on data accuracy parameters, respondents find them slightly hard to understand as full profile of the products being offered is not present. However, once respondents cross the barrier of understanding, these methods prove extremely enjoyable and encourage them to give honest responses. They also take a longer time to evaluate. Therefore, these should be used in studies where the sole component of the survey design is the conjoint exercise. CONCLUSION This paper shows that researchers can confidently conduct conjoint in mobile surveys. Respondents enjoy taking conjoint surveys on their mobile, probably due its “on-the-go” nature. 67 Researchers might want to adopt simple techniques like screen space optimization and simplification of tasks in order to conduct conjoint exercises on mobile platforms. This research also indicates that the data obtained from conjoint on mobile platforms is robust and mirrors data from personal computers to a certain extent (shown by high correlation numbers). This research supports the idea that researchers can probably safely group responses from mobile platforms and personal computers and analyze them without the risk of error. Chris Diener 68 CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS JOSEPH WHITE MARITZ RESEARCH BACKGROUND Recent years have witnessed the rapid adoption of increasingly mobile computing devices that can be used to access the internet, such as smartphones and tablets. Along with this increased adoption we see an increasing proportion of respondents complete our web-based surveys in mobile environments. Previous research we have conducted suggests that these mobile responders behave similarly to PC and tablet responders. However, these tests have been limited to traditional surveys primarily using rating scales and open-ended responses. Discrete choice experiments may present a limitation for this increasingly mobile respondent base as the added complexity and visual requirements of such studies may make them infeasible or unreliable for completion on a smartphone. The current paper explores this question through two large case studies involving more complicated choice experiments. Both of our case studies include design spaces based on 8 attributes. In one we present partial and full profile sets of 3 alternatives, and in the second we push respondents even further by presenting sets of 5 alternatives each. For both case studies we seek to understand the potential impact of conducting choice experiments in mobile web environments by investigating differences in parameter estimates, respondent error, and predictive validity by form factor of survey completion. CASE STUDY 1: TABLET Research Design Our first case study is a web-based survey among tablet owners and intenders. The study was conducted in May of 2012 and consists of six cells defined by design strategy and form factor. The design strategies are partial and full profile, and the form factors are PC, tablet, and mobile. The table below shows the breakdown of completes by cell. PC Tablet Mobile Partial Profile 201 183 163 Full Profile 202 91 164 Partial profile respondents were exposed to 16 choice sets with 3 alternatives, and full profile respondents completed 18 choice sets with 3 alternatives. The design space consisted of 7 attributes with 3 levels and one attribute with 2 levels. Partial profile tasks presented 4 of the 8 attributes in each task. All respondents were given the same set of 6 full profile holdout tasks after the estimation sets, each with 3 alternatives. Below is a typical full profile choice task. 69 Please indicate which of the following tablets you would be most likely to purchase. Operating System Apple Windows Android Memory 8 GB 64 GB 16 GB Included Cloud 5 GB 50 GB None Storage (additional at extra cost) Price $199 $799 $499 Screen Resolution High definition Extra-high definition High definition display (200 pixels display display (200 pixels per inch) (300 pixels per inch) per inch) Camera Picture 5 Megapixels 0.3 Megapixels 2 Megapixels Quality Warranty 1 Year 3 Months 3 Years Screen Size 7˝ 10˝ 5˝ Analysis By way of analysis, the Swait-Louviere test (Swait & Louviere, 1993) is used for parameter and scale equivalence tests on aggregate MNL models estimated in SAS. Sawtooth Software’s CBC/HB is used for predictive accuracy and error analysis, estimated at the cell level, i.e., design strategy by form factor. In order to account for demographic differences by device type, data are weighted by age, education, and gender, with the overall combined distribution being used as the target to minimize any distortions introduced through weighting. Partial Profile Results Mobile m = 0.96 lA = 62 => Reject H1A R2 = 0.81 Apple Android Tablet m = 1.05 lA = 156 => Reject H1A Tablet m = 1.00 Apple PC m = 1.00 R2 = 0.89 PC m = 1.00 The Swait-Louviere parameter equivalence test is a sequential test of the joint null hypothesis that scale and parameter vectors of two models are equivalent. In the first step we test for equivalence of parameter estimates allowing scale to vary. If we fail to reject the null hypothesis for this first step then we move to step 2 where we test for scale equivalence. R2 = 0.94 Mobile m = 0.94 lA = 36 => Reject H1A In each pairwise test we readily reject the null hypothesis that parameters do not differ beyond scale, indicating that we see significant differences in preferences by device type. Because we reject the null hypothesis in the first stage of the test we are unable to test for 70 significant differences in scale. However, the relative scale parameters suggest we are not seeing dramatic differences in scale by device type. The chart below shows all three parameter estimates side-by-side. When we look at the parameter detail, the big differences we see are with brand, price, camera, and screen size. Not surprisingly, brand is most important for tablet responders with Apple being by far the most preferred option. It is also not too surprising that mobile responders are the least sensitive to screen size. This suggests we are capturing differences one would expect to see, which would result in a more holistic view of the market when the data are combined. 0.60 0.40 0.20 0.00 -0.20 -0.40 -0.60 App And Brand Win 8GB 16GB 64GB 0GB Hard Drive 5GB 50GB Att4 Cloud Storage Price PP PC HD XHD Resolution PP Tablet 0.3 2.0 5.0 Camera (Mpx) 3Mo 1Yr Warranty 3Yr 5" 7" 10" Screen Size PP Mobile We used mean absolute error (MAE) and hit rates as measures of predictive accuracy. The HB parameters were tuned to optimize MAE with respect to the six holdout choice tasks by form factor and design strategy. The results for the partial profile design strategy are shown in the table below. PP PC PP Tablet PP Mobile Base 201 183 163 MAE 0.052 0.058 0.052 Hit Rate 0.60 0.62 0.63 In terms of hit rate both mobile and tablet responders are marginally better than PC responders, although not significantly. While tablet responders have a higher MAE, mobile responders are right in line with PC. At least in terms of in-sample predictive accuracy it appears that mobile responders are at par with their PC counterparts. We next present out-of-sample results in the table below. Holdouts PC Tablet Mobile Average Random 0.143 0.160 0.142 PC 0.052 0.103 0.060 0.082 Prediction Utilities Tablet Mobile 0.082 0.053 0.058 0.074 0.077 0.052 0.080 0.056 71 All respondents were presented with the same set of 6 holdout tasks. These tasks were used for out-of-sample predictive accuracy measures by looking at the MAE of PC responders predicting tablet responders holdouts, as an example. In the table above the random column shows the mean absolute deviation from random of the choices to provide a basis from which to judge relative improvement of the model. The remainder of the table presents MAE when utilities estimated for the column form factor were used to predict the holdout tasks for the row form factor. Thus the diagonal is the in-sample MAE and off diagonal is out-of-sample. Finally, the average row is the average MAE for cross-form factor. For example, the average PC MAE of 0.082 is the average MAE using PC based utilities to predict tablet and mobile holdouts individually. Mobile outperforms both tablet and PC in every pairwise out-of-sample comparison. In other words, mobile is better at predicting PC than tablet and better than PC at predicting tablet holdouts. This can be seen at both the detail and average level. In fact, Mobile is almost as good at predicting PC holdouts as PC responders. Wrapping up the analysis of our partial profile cells we compared the distribution of RLH statistics output from Sawtooth Software’s CBC/HB to see if there are differences in respondent error by device type. Note that the range of the RLH statistic is from 0 to 1,000 (three implied decimal places), with 1,000 representing no respondent error. That is, when RLH is 1,000 choices are completely deterministic and the model explains the respondent’s behavior perfectly. In the case of triples, a RLH of roughly 333 is what one would expect with completely random choices where the model adds nothing to explain observed choices. Below we chart the cumulative RLH distributions for each form factor. RLH Cumulative Distributions Cumulative Percent 100% PP PC PP Tablet PP Mobile 0% 0 200 400 600 800 1000 RLH Just as with a probability distribution function, the cumulative distribution function (CDF) allows us to visually inspect and compare the first few moments of the underlying distribution to understand any differences in location, scale (variance), or skew. Additionally, the CDF allows us to directly observe percentiles, thereby quantifying where excess mass may be and how much mass that is. 72 The CDF plots above indicate the partial profile design strategy results in virtually identically distributed respondent error by form factor. As this represents three independent CBC/HB runs (models were estimated separately by form factor) we are assured that this is indeed not an aggregate result. Full Profile Results R2 = 0.89 R2 = 0.81 R2 = 0.94 R2 R2 R2 Apple = 0.72 Tablet m = 1.00 m = 1.00 = 0.81 Apple PC PC m = 1.00 Partial profile results suggest mobile web responders are able to reliably complete smaller tasks on either their tablet or smartphone. As we extend this to the full profile strategy the limitations of form factor screen size, especially among smartphone responders, may begin to impact quality of results. We take the same approach to analyzing the full profile strategy as we did with partial profile, first considering parameter and scale equivalence tests. Windows PP R2 = 0.55 Apple Android Android Price Mobile Tablet m = 0.96 lA = 61 => Reject H1A Mobile m = 1.05 lA = 99 => Reject H1A m = 0.94 lA = 138 => Reject H1A As with the partial profile results, we again see preferences differing significantly beyond scale. The pairwise tests above show even greater differences in aggregate utility estimates than with partial profile results, as noted by the sharp decline in parameter agreement as measured by the R2 fit statistic. However, while we are unable to statistically test for differences in scale we again see relative scale parameter estimates near 1 suggesting similar levels of error. 0.60 0.40 0.20 0.00 -0.20 -0.40 -0.60 App And Brand Win 8GB 16GB 64GB 0GB Hard Drive 5GB 50GB Att4 Cloud Storage FP PC Price HD XHD Resolution FP Tablet 0.3 2.0 5.0 Camera (Mpx) 3Mo 1Yr Warranty 3Yr 5" 7" 10" Screen Size FP Mobile Studying the parameter estimate detail in the chart above, we see a similar story as before with preferences really differing on brand, price, camera, and screen size. And again, the differences are consistent with what we would expect given the market for tablets, PCs, and 73 smartphones. For all three device types Apple is the preferred tablet brand which is consistent with the market leader position they enjoy. Android, not having a real presence (if any) in the PC market is the least preferred for PC and tablet responders, which is again consistent with market realities. Android showing a strong second position among mobile responders is again consistent with market realities as Android is a strong player in the smartphone market. Tablet responders also being apparently less sensitive to price is also what we would expect given the price premium of Apple’s iPad. In-sample predictive accuracy is presented in the table below and we again see hit rates for tablet and PC responders on par with one another. However, under the full profile design strategy mobile responders outperform PC responders in terms of both MAE and hit rate, with the latter being significant with 90% confidence for a one tail test. In terms of MAE tablet responders outperform both mobile and PC responders. In-sample predictive accuracy of tablet responders in terms of MAE is most likely the result of brand being such a dominant attribute. FP PC FP Tablet FP Mobile Base 202 91 164 MAE 0.044 0.028 0.033 Hit Rate 0.70 0.70 0.76 Looking at out-of-sample predictive accuracy in the table below we see some interesting results across device type. First off, the random choice MAE is consistent across device types making the direct comparisons easier in the sense of how well the model improves prediction over the no information (random) case. The average cross-platform MAE is essentially the same for mobile and PC responders, again suggesting that mobile responders provide results that are on par with PC, at least in terms of predictive validity. Interestingly, and somewhat surprisingly, utilities derived from mobile responders are actually better at predicting PC holdouts than those derived from PC responders. Holdouts PC Tablet Mobile Average Random 0.163 0.163 0.166 PC 0.044 0.056 0.059 0.058 Prediction Utilities Tablet Mobile 0.101 0.038 0.028 0.074 0.102 0.033 0.102 0.056 While PC and mobile responders result in similar predictive ability, tablet responders are much worse at predicting out-of-sample holdouts. On average tablet responders show almost twice the out-of-sample error as mobile and PC responders, and compared to in-sample accuracy tablet out-of-sample has nearly 4 times the amount of error. This is again consistent with a dominant attribute or screening among tablet responders. If tablet responder choices are being determined by a dominant attribute then we would expect to see the mass of the RLH CDF shifted to the right. The cumulative RLH distributions are shown below for each of the three form factors. 74 RLH Cumulative Distributions Cumulative Percent 100% FP PC FP Tablet FP Mobile 0% 0 200 400 600 800 1000 RLH As with partial profile, full profile PC and mobile groups show virtually identical respondent error. However, we do see noticeably greater error variance among tablet responders, with greater mass close to the 1,000 RLH mark as well as more around the 300 range. This suggests that we do indeed see more tablet responders making choices consistent with a single dominating attribute. In order to explore the question of dominated choice behavior further, we calculated the percent of each group who chose in a manner consistent with a non-compensatory or dominant preference structure. A respondent was classified as choosing according to dominating preferences if he/she always chose the option with the same level for a specific attribute as the most preferred. For example, if a respondent always chose the Apple alternative we would say their choices were determined by the brand Apple. This should be a rare occurrence if people are making trade-offs as assumed by the model. The results from this analysis are in the table below. Dominating Preference Partial Profile Full Profile PC Tablet Mobile PC Tablet Mobile 201 183 163 202 91 164 92 44 47 48 33 29 0.23 0.24 0.29 0.24 0.36 0.18 Base Number Percent P-Values* PC 0.770 0.141 Tablet 0.312 * P-Values based on two-tail tests 0.027 0.156 0.001 These results are for dominating preference as described above. In the upper table are the summary statistics by form factor and design strategy. The “Base” row indicates the number of respondents in that cell, “Number” those responding in a manner consistent with dominated preferences, and “Percent” what percent that represents. For example, 23% of PC responders in 75 the partial profile strategy made choices consistent with dominated preferences. The lower table presents p-values associated with the pairwise tests of significance between incidences of dominated choices. Looking at the partial profile responders, this means that the 24% Tablet versus the 23% PC responders has an associated p-value of 77% indicating the two are not significantly different from one another. It should not be surprising that we see no significant differences between form factors for the partial profile in terms of dominated choices because of the nature of the design strategy. In half of the partial profile exercises a dominating attribute will not be shown, forcing respondents to make trade-offs based on attributes lower in their preference structure. However, when we look at the full profile we do see significant differences by form factor. Tablet responders are much more likely to exhibit choices consistent with dominated preferences or screening strategies than either mobile or PC responders who are on par with one another. Differences are both significant with 95% confidence as indicated by the bold p-values in the lower table. CASE STUDY II: SIGNIFICANT OTHER Research Design The second case study was also a web-based survey, this time among people who were either in, or interested in being in, a long-term relationship. We again have an eight attribute design, although we increase the complexity of the experiment by presenting sets of five alternatives per choice task. A typical task is shown below. Which of these five significant others do you think is the best for you? Attractiveness Romantic/ Passionate Honesty/Loya lty Funny Intelligence Political Views Religious Views Annual Income Not Very Attractive Not Very Romantic/ Passionate Mostly Trust Very Attractive Somewhat Romantic/ Passionate Can’t Trust Not Very Attractive Very Romantic/ Passionate Can’t Trust Very Funny Sometimes Funny Not Very Smart Strong Republican Religious— Not Christian Very Funny Pretty Smart Strong Democrat Christian Brilliant $15,000 $15,000 Strong Republican No Religion/Sec ular $15,000 Very Attractive Not Very Romantic/ Passionate Completely Trust Not Funny Somewhat Attractive Not Very Romantic/ Passionate Completely Trust Not Funny Not Very Smart Strong Democrat Religious— Not Christian Pretty Smart $40,000 Strong Democrat No Religion/Sec ular $40,000 All attributes other than annual income are 3 level attributes. Annual income is a 5 level attribute ranging from $15,000 to $200,000. There were two cells in this study, one for 76 estimation and one to serve as a holdout sample. The estimation design consists of 5 blocks of 12 tasks each. The holdout sample also completed a block of 12 tasks, and both estimation and holdout samples completed the same three holdout choice tasks. The holdout sample consists of only PC and tablet responders. Given the amount of overlap with 5 alternatives in each task, combined with the amount of space required to present the task, we expect mobile responders to be pushed even harder than with the tablet study. Analysis We again employ aggregate MNL for parameter equivalence tests and Hierarchical Bayes via Sawtooth Software’s CBC/HB for analysis of respondent error and predictive accuracy. However, in contrast to the tablet study we did not set individual quotas for device type, so a lack of reasonable balance results in taking a matching approach to analysis rather than simply weighting by demographic profiles. These matching dimensions are listed in the table below. Matching Dimension Design Block Age Gender Children in House Income Cells 1–5 18–34, 35+ Male/Female Yes/No <$50,000, $50,000+ In each comparison with PC, we used simple random sampling to select PC responders according to the tablet or mobile responder distribution over the above dimensions. For example, if we had 3 mobile responders who completed block 2, were females between 18 and 34 years old with children at home and making more than $50,000 per year, we randomly selected 3 PC responders with the exact same block-demographic profile. The table below shows the breakdown of completes. Matched Profiles Total Estimation Holdout Tablet Mobile 1,860 1,378 482 727 771 Tablet 98 73 25 - 39 Mobile 88 88 0 52 - PC The total completes were comprised of 1,860 PC, 98 tablet, and 88 mobile responders. Of the 1,860 PC responders, 1,378 were part of the estimation cell and 482 were used for the holdout sample. In the estimation sample, 727 PC responders had block-demographic profiles matching at least one tablet responder, and 771 matching at least one mobile responder. The mobile versus tablet responders comparisons are not presented due to the small number of respondents with matching profiles. As previously stated, we used simple random sampling to select a subset of PC responders with block-demographic profile distributions identical to tablet or mobile, depending on comparison. Respondents without matching profiles (in either set) were excluded from the analysis. This process was repeated 1,000 times for parameter equivalent tests using aggregate 77 MNL, and 100 times for predictive accuracy and respondent error using Sawtooth Software’s CBC/HB. Results Parameter and scale equivalence tests were performed at each of the 1,000 iterations described in the analysis section. The charts below summarize the results of the comparison between mobile and PC responders. MNL Parameter Comparison Mobile Relative Scale (m) 25% 72.4% > 1 Mobile R2 = 0.95 S&L Test Results 1,000 Iterations Fail to reject H1A: 19.5% R2 = 0.84 Average PC Fail to reject H1B: 76.9% 0% 0.5 1.0 1.5 2.0 In the first panel the average PC parameter estimates are plotted against the mobile parameter estimates. We see a high degree of alignment overall with an R2 value of 0.95. However, the two outliers point to the presence of a dominating attribute, so we also present the fit for the inner set of lesser important attributes, where we still see strong agreement with an R2 value of 0.84, which is a correlation of over 0.9. The middle panel shows the distribution of the relative scale parameter estimated in the first step of the Swait-Louviere test with mobile showing slightly higher scale about 72% of the time. The right panel above summarizes the test results. Note that if PC and mobile responders were to result in significantly different preferences or scale that we would expect to fail to reject the null hypothesis no more than 5% of the time for tests at the 95% level of confidence. In both H1A (parameter) and H1B (scale) we fail to reject the null hypothesis well in excess of 5% of the time, indicating that we do not see significant differences in preferences or scale between PC and Mobile. Looking at the detailed parameter estimates in the chart below further reinforces similarity of data after controlling for demographics. 78 1.5 1.0 0.5 0.0 -0.5 Avg PC -1.0 Mobile Attractiveness Romantic/ Passionate Honesty/ Loyalty Funny Intelligence Political Views Annual Income No Religion Non-Christian Christian Strong Democrat Swing Voter Strong Repub Brilliant Pretty Smart Not Very Very Sometime Not Funny Complete Mostly Can't Trust Very Somewhat Not Very Very Somewhat Not Very -1.5 Religious Views Comparing tablet and PC parameters and scale we see an even more consistent story. The results are summarized in the charts below. Even when we look at the consistency between parameter estimates on the lesser important inner attributes we have an R2 fit statistic of 0.94, which is a correlation of almost 0.97. Over 90% of the time the relative Tablet scale parameter is greater than 1, suggesting that we may be seeing slightly less respondent error among those completing the survey on a tablet. However, as the test results to the right indicate, neither parameter estimates nor scales differ significantly. MNL Parameter Comparison Tablet Relative Scale (m) 25% 90.7% > 1 Tablet R2 = 0.98 S&L Test Results 1,000 Iterations Fail to reject H1A: 75.1% R2 = 0.94 PC Fail to reject H1B: 51.8% 0% 0.5 1.0 1.5 2.0 Test results for mobile versus tablet also showed no significant differences in preferences or scale. Although as noted earlier, those results are not presented due to available sample sizes and that the story is sufficiently similar as to not add meaningfully to the discussion. Turning to in-sample predictive accuracy, holdout MAE and hit rates are presented in the table below. 79 Base MAE Tablet 73 0.050 Mobile 88 0.037 * Mean after 100 iterations PC Matched MAE* Hit Rate* 0.039 0.53 0.034 0.54 Hit Rate 0.53 0.53 In the table, the first MAE and Hit Rate columns refer to the results for the row form factor responders. For example, among tablet responders the in-sample holdout MAE is 0.050 and hit rate is 53%. The PC Matched MAE and Hit Rate refer to the average MAE and hit rate over the iterations matching PC completes to row form factor responders. In this case, the average MAE for PC responders matched to tablet is 0.039, with a mean hit rate of 53%. Controlling for demographic composition and sample sizes brings all three very much in line with one another in terms of in-sample predictive accuracy, although tablet responders appear to be the least consistent internally with respect to MAE. Out-of-sample predictive accuracy shows a similar story for mobile compared to PC responders. Once we control for sample size differences and demographic distributions, mobile and PC responders have virtually the same out-of-sample MAE. PC responders when matched to tablet did show a slightly higher out-of-sample MAE than actual tablet responders, although we do not conclude this to be a substantial strike against the form factor. Out-of-sample results are summarized below. MAE Tablet 0.050 Mobile 0.037 * Mean after 100 iterations PC Matched MAE* 0.039 0.034 The results thus far indicate that mobile and tablet responders provide data that is at least on par with PC responders in terms of preferences, scale, and predictive accuracy. To wrap up the results of our significant other case study we look at respondent error as demonstrated with the RLH cumulative distributions in the chart below. RLH Cumulative Distribution 100% Mobile Tablet PC - Mobile PC - Tablet 0% 0 80 200 400 600 800 1000 We again see highly similar distributions of RLH by form factor. The PC matched cumulative distribution curves are based on data from all 100 iterations, which explains the relative smooth shape of the distribution. There is possibly a slight indication that there is less error among the mobile and tablet responders, although we do not view this as substantially different. The slight shift of mass to the right is consistent with relative scale estimates in our parameter equivalence tests, which were not statistically significant. CONCLUSION In both our tablet and significant other studies we see similar results regardless of which form factor the respondent chose to complete the survey. However, the tablet study does indicate the potential for capturing differing preferences by device type of survey completion. Given the context of that study, this finding is not at all surprising, and in fact is encouraging in that we are capturing more of the heterogeneity in preferences we would expect to see in the marketplace. It would be odd if tablet owners did not exhibit different preferences than non-owners given the experience with usage. On the other hand, we observe the same preferences regardless of form factor in the significant other study, which is what we would expect for a non-technical topic unrelated to survey device type. More importantly than preference structures, which we should not expect to converge a priori, both of our studies indicate that the quality of data collected via smartphone is on par with, or even slightly better than, that collected from PC responders. In terms of predictive accuracy, both in and out-of-sample, and respondent error, we can be every bit as confident in choice experiments completed in a mobile environment as in a traditional PC environment. Responders who choose to complete surveys in a mobile environment are able to do so reliably, and we should therefore not exclude those folks from choice experiments based on assumptions of the contrary. In light of the potential for capturing different segments in terms of preferences, we should actually welcome the increased diversity offered by presenting choice experiments in different web environments. 81 Joseph White REFERENCES Swait, J., & Louviere, J. (1993). The Role of the Scale Parameter in the Estimation and Comparison of Multinomial Logit Models. Journal of Marketing Research, 30(3), 305–314. 82 USING COMPLEX MODELS TO DRIVE BUSINESS DECISIONS KAREN FULLER HOMEAWAY, INC. KAREN BUROS RADIUS GLOBAL MARKET RESEARCH ABSTRACT HomeAway offers an online marketplace for vacation travelers to find rental properties. Vacation home owners and property managers list rental property on one or more of HomeAway’s websites. The challenge for HomeAway was to design the pricing structure and listing options to better support the needs of owners and to create a better experience for travelers. Ideally, this would also increase revenues per listing. They developed an online questionnaire that looked exactly like the three pages vacation homeowners use to choose the options for their listing(s). This process nearly replicated HomeAway’s existing enrollment process (so much so that some respondents got confused regarding whether they had completed a survey or done the real thing). Nearly 2,500 US-based respondents completed multiple listings (MBC tasks), where the options and pricing varied from task to task. Later, a similar study was conducted in Europe. CBC software was used to generate the experimental design, the questionnaire was custom-built, and the data were analyzed using MBC (Menu-Based Choice) software. The results led to specific recommendations for management, including the use of a tiered pricing structure, additional options, and an increase in the base annual subscription price. After implementing many of the suggestions of the model, HomeAway has experienced greater revenues per listing and the highest renewal rates involving customers choosing the tiered pricing. THE BUSINESS ISSUES HomeAway Inc., located in Austin Texas, is the world’s largest marketplace for vacation home rentals. HomeAway sites represent over 775,000 paid listings for vacation rental homes in 171 countries. Many of these sites recently merged under the HomeAway corporate name. For this reason, subscription configurations could differ markedly from site-to-site. Vacation home owners and property managers list their rental properties on one or more HomeAway sites for an annual fee. The listing typically includes details about the size and location of the property, photos of the home, a map, availability calendar and occasionally a video. Travelers desiring to rent a home scan the listings in their desired area, choose a home, contact and rent directly from the owner and do not pay a fee to HomeAway. HomeAway’s revenues are derived solely from owner subscriptions. Owners and property managers have a desire to enhance their “search position,” ranking higher in the available listings, to attract greater rental income. HomeAway desired to create a more uniform approach for listing properties across its websites, enhance the value and ease of listing for the owner, and encourage owners to provide high quality listings while creating additional revenue. 83 THE BUSINESS ISSUE The initial study was undertaken in the US for sites under the names HomeAway.com and VRBO.com. The HomeAway.com annual subscription included a thumbnail photo next to the property listing, 12 photos of the property, a map, availability calendar and a video. Owners could upload additional photos if desired. The search position within the listings was determined by an algorithm rating the “quality” of the listing. The VRBO.com annual subscription included four photos. Owners could pay an additional fee to show more photos which would move their property up in the search results. With the purchase of additional photos came enhancements such as a thumbnail photo, map and video. The business decision entailed evaluating an alternative tiered pricing system tied to the position on the search results (e.g., Bronze, Silver and Gold) versus alternative tiered systems based on numbers of photos. THE STUDY DESIGN The designed study required 15 attributes arrayed as follows using an alternative-specific design through Sawtooth Software’s CBC design module: Five alternative “Basic Listing” options: o Current offer based on photos o Basic offer includes fewer photos with the ability to pay for extra photos and obtain “freebies” (e.g., thumbnail photo) and improve search position o Basic offer includes fewer photos and includes “freebies” (e.g., thumbnail photo). The owner can “buy up” additional photos to improve search position o Basic offer includes many photos but no “freebies.” Pay directly for specific search position and obtain “freebies.” o Basic offer includes many photos and “freebies.” Pay directly for specific search position. Pricing for Basic Offers—five alternatives specific to the “Basic Listing” “Buy Up” Tiers offered specific to Basic Listing offer—3, 7 and 11 tiers Tier prices—3 levels under each approach Options to list on additional HomeAway sites (US only, Worldwide, US/Europe/Worldwide options) Prices to list on additional sites—3 price levels specific to option Other listing options (Directories and others) THE EXERCISE Owners desiring to list a home on the HomeAway site select options they wish to purchase on a series of three screens. For this study the screens were replicated to closely resemble the 84 look and functionality of the sign-up procedure on the website. These screens are shown in the Appendix. Additionally, the search position shown under the alternative offers was customized to the specific market where the rental home was located. In smaller markets “buying up” might put the home in the 10th position out of 50 listings; in other larger markets the same price might only list the home 100th out of 500 listings. As the respondent moved from one screen to the next the “total spend” was shown. The respondent always had the option to return to a prior screen and change the response until the full sequence of three screens was complete. Respondents completed eight tasks of three screens. THE INTERVIEW AND SAMPLE The study was conducted through an online interview in the US in late 2010/early 2011 among current and potential subscribers to the HomeAway service. The full interview ran an average of 25 minutes. 903 current HomeAway.com subscribers 970 current VRBO.com subscribers 500 prospective subscribers who rent or intend to rent a home to vacationers and do not list on a HomeAway site Prospective subscribers were recruited from an online panel. THE DATA Most critical to the usefulness of the results is an assurance that the responses are realistic, that respondents were not overly fatigued and were engaged in the process. To this end, the median and average “spend” per task are examined and shown in the table below: These results resemble closely the actual spend among current subscribers. Additionally, spend by task did not differ markedly. A large increase/decrease in spend in the early/later tasks might indicate a problem in understanding or responding to the tasks. The utility values for each of the attribute levels were estimated using Sawtooth Software’s HB estimation. Meaningful interactions were embedded into the design. Additional cross-effects were evaluated. 85 To further evaluate data integrity HB results were run for all eight tasks in total, the first six tasks in total and the last six tasks in total. The results of this exercise indicate that using all eight tasks was viable. Results did not differ in a meaningful way when beginning or ending tasks were dropped from the data runs. The results for several of the attributes are shown in the following charts. 86 THE DECISION CRITERIA Two key measures were generated for use by HomeAway in their financial models to implement a pricing strategy—a revenue index and a score representing the appeal of the offer to homeowners. These measures were generated in calculations in an Excel-based simulator, an example of which is shown below: 87 N =1470 N = 829 N = 310 N = 147 N = 184 Total Small Medium Large Extra Large Exercise Appeal Score of Option: 84.1 86.5 83.1 78.5 79.5 Subgroup Revenue Index: 115.0 111.7 116.8 124.0 119.7 Basic Listing Type: Price: How you buy up: Additonal Price: 5 Photos 6 Photos 7 Photos 8 Photos 9 Photos 10 Photos 11 Photos 12 Photos 13 Photos 14 Photos 15 Photos 16 Photos Listings on additional sites: Basic US Base Price $30 $60 $90 $120 $150 $180 $210 $240 $270 $300 $330 Total 68.2% 3.0% 0.0% 0.0% 0.0% 0.0% 0.0% 12.8% 0.0% 0.0% 0.0% 16.1% Small 73.2% 2.9% 0.0% 0.0% 0.0% 0.0% 0.0% 10.4% 0.0% 0.0% 0.0% 13.7% Medium 66.5% 3.3% 0.0% 0.0% 0.0% 0.0% 0.0% 14.6% 0.0% 0.0% 0.0% 15.6% Large 51.2% 1.6% 0.0% 0.0% 0.0% 0.0% 0.0% 22.2% 0.0% 0.0% 0.0% 25.6% Extra Large 62.4% 3.9% 0.0% 0.0% 0.0% 0.0% 0.0% 13.0% 0.0% 0.0% 0.0% 20.6% Total 64.3% 35.8% Small 70.7% 29.4% Medium 60.8% 39.2% Large 45.1% 54.9% Extra Large 56.9% 43.1% Additional Price: No Additional Price 100 Featured Listing: No Additional Listing 88 ` Featured Listing 1 Month 3 Months 6 Months 12 Months plus $49 plus $99 plus $149 plus $199 Total 0.0% 11.5% 0.1% 9.3% Small 0.0% 9.9% 0.1% 8.5% Medium 0.0% 12.2% 0.2% 10.4% Large 0.0% 13.8% 0.0% 13.3% Extra Large 0.0% 15.9% 0.0% 7.8% Featured Directory Golf Directory Ski Directory $59 for 12 months $59 for 12 months 4.3% 4.1% 2.8% 2.6% 4.1% 4.1% 3.0% 3.0% 12.3% 12.1% Additonal Features Special Offer $20 per week 4.4% 2.8% 4.1% 3.0% 13.1% In this simulator, the user can specify the availability of options for the homeowner, pricing specific to each option and the group of respondents to be studied. The appeal measures are indicative of the interest level for that option among homeowners in comparison to the current offer. The revenue index is a relative measure indicating the degree to which the option studied might generate revenue beyond the “current” offer (Index = 100). The ideal offer would generate the highest appeal while maximizing revenue. The results in the simulator were weighted to reflect the proportion of single and dual site subscribers and potential prospects (new listings) according to current property counts. BUSINESS RECOMMENDATIONS The decision whether to move away from pricing based on the purchase of photos in a listing to an approach based on direct purchase of a listing “tier” was critical for the sites studied as well as other HomeAway sites. Based on these results, HomeAway chose to move to a tiered pricing approach. Both approaches held appeal to homeowners but the tiered approach generated the greater upside revenue potential. Additional research also indicated that the tiered system provided a better traveler experience in navigating the site. While the study evaluated three, seven and eleven tier approaches, HomeAway chose a five tier approach (Classic, Bronze, Silver, Gold and Platinum). Fewer tiers, generally, outperformed higher tier offers. The choice of five tiers offered HomeAway greater flexibility in its offer. Price tiers were implemented in market at $349 (Classic); $449 (Bronze); $599 (Silver); $749 (Gold) and $999 (Platinum). Each contained “value-added” features bundled in the offer to allow for greater price flexibility. These represent a substantive increase in the base annual subscription prices. HomeAway continued to offer cross-sell options and additional listing offers (feature directories, feature listings and other special offers) to generate additional revenue beyond the base listing. SOME LESSONS LEARNED FOR FUTURE MENU-BASED STUDIES Substantial “research” learning was also generated through this early foray into menu-based choice models. We believe that one of the keys to the success of this study was the “strive for realism” in the presentation of the options to respondents. (The task was sufficiently realistic that HomeAway received numerous phone calls from its subscribers asking why their “choices” in the task had not appeared in their listings.) Realism was implemented not only in the “look” of the pages but also in the explanations of the listing positions calculated based on their own listed homes. Also critical to success of any menu-based study is the need to strike a “happy medium” in terms of number of variables studied and overall sample size. o While the flexibility of the approach makes it tempting to the researcher to “include everything,” parsimony pays. Both the task and the analysis can be overwhelming when non-critical variables to the decision are included. 89 Estimation of cross-effects is also challenging. Too many cross-effects quickly result in model over-specification resulting in cross-cancellations of the needed estimations. o Sufficient sample size is likewise critical but too much sample is likewise detrimental. In the study design keep in mind that many “sub-models” are estimated and sample must be sufficient to allow a stable estimation at the individual level. Too much sample however presents major challenges to computing power and ultimately simulation. In simulation it is important to measure from a baseline estimate. This is a research exercise with awareness and other marketing measures not adequately represented. Measurement from a baseline levels the playing field for these factors having little known effect providing confidence in the business decisions made. This is still survey research and we expect a degree of over-statement by respondents. Using a “baseline” provides a consistency to the over-statement. IN-MARKET EXPERIENCE HomeAway implemented the recommendations in market for its HomeAway and VRBO businesses. Subsequent to the effort, the study was repeated, in a modified form, for HomeAway European sites. In market adoption of the tiered system exceeded the model predictions. Average revenue per listing increased by roughly 15% over the prior year. Additionally, HomeAway experienced the highest renewal rates among subscribers adopting the tiered system. Brian Sharples, Co-founder and Chief Executive Officer noted: “The tiered pricing research allowed HomeAway to confidently launch tiered pricing to hundreds of thousands of customers in the US and European markets. Our experience in market has been remarkably close to what the research predicted, which was that there would be strong demand for tiered pricing among customers. Not only were we able to provide extra value for our customers but we also generated substantial additional revenue for our business.” 90 Karen Fuller Karen Buros 91 APPENDIX—SCREEN SHOTS FROM THE SURVEY 92 93 94 95 AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY BRENT FULLER MATT MADDEN MICHAEL SMITH THE MODELLERS ABSTRACT There are many ways to handle conjoint attributes with many levels including progressive build tasks, partial profile tasks, tournament tasks and adaptive approaches. When only one attribute has many levels, an additional method is to augment choice data with data from other parts of the survey. We show how this can be accomplished with a standard discrete choice task and a Q-Sort exercise. PROBLEM DEFINITION AND PROPOSED SOLUTION Often, clients come to us with an attribute grid that has one attribute with a large number of levels. Common examples include promotions or messaging attributes. Having too many levels in an attribute can lead to excessive respondent burden, insufficient level exposure and nonintuitive results or reversals. To help solve the issue we can augment discrete choice data with other survey data focused on that attribute. Sources for augmentation could include MaxDiff exercises, Q-Sort and other ranking exercises, rating batteries and other stated preference questions. Modeling both sets of data together allows us to get more information and better estimates for the levels of the large attribute. We hope to find the following in our augmented discrete choice studies: 1st priority—Best estimates of true preference 2nd priority—Better fit with an external comparison 3rd priority—Better holdout hit rates and lower holdout MAEs Approaches like this are fairly well documented. In 2007 Hendrix and Drucker showed how data augmentation can be used on a MaxDiff exercise with a large number of items. Rankings data from a Q-Sort task were added to the MaxDiff information and used to improve the final model estimates. In another paper in 2009 Lattery showed us how incorporating stated preference data as synthetic scenarios to a conjoint study can improve estimation of individual utilities, higher hit rates and more consistent utilities resulted. Our augmenting approach is similar and we present two separate case studies below. Both augment discrete choice data with synthetic scenarios. The first case study augments a single attribute that has a large number of levels using data from a Q-Sort exercise about that attribute. The second case study augments several binary (included/excluded) attributes with data from a separate scale rating battery of questions. CASE STUDY 1 STRUCTURE We conducted a telecom study with a discrete choice task trading off attributes such as service offering, monthly price, additional fees and contract type. The problematic attribute listed 97 promotion gifts that purchasers would receive for free when signing up for the service. This attribute had 19 levels that the client wanted to test. We were concerned that the experimental design would not give sufficient coverage to all the levels of the promotion attribute and that the discrete choice model would yield nonsensical results. We know from experience that ten levels for one attribute is about the limit of what a respondent can realistically handle in a discrete choice exercise. The final augmented list is shown in Table 1. Table 1. Case Study 1 Augmented Promotion Attribute Levels Augmentation List $100 Gift Card $300 Gift Card $500 Gift Card E-reader Gaming Console 1 Tablet 1 Mini Tablet Tablet 2 Medium Screen TV Small Screen TV Gaming Console 2 HD Headphones Headphones Home Theatre Speakers 3D Blu-Ray Player 12 month Gaming Subscription We built a standard choice task (four alternatives and 12 scenarios) with all attributes. Later in the survey respondents were asked a Q-Sort exercise with levels from the free promotion gift attribute. Our Q-Sort exercise included the questions below to obtain a multi-step ranking. These ranking questions took one and a half to two minutes for respondents to complete. 1) Which of the following gifts is most appealing to you? 2) Of the remaining gifts, please select the next 3 which are most appealing to you. (Select 3 gifts) 3) Of the remaining gifts, which is the least appealing to you? (Select one) 4) Finally, of the remaining gifts, please select the 3 which are least appealing to you (Select 3 gifts) In this way we were able to obtain promotion gift ranks for each respondent. We coded the Q-Sort choices into a discrete choice data framework as a series of separate choices and appended these as extra scenarios within the standard discrete choice data. Based on ranking comparisons the item chosen as the top rank was coded to be chosen compared to all others. The items chosen as “next top three” each beat all the remaining items. The bottom three each beat the last ranked item in pairwise scenarios. We estimated two models to begin with, one standard discrete choice model and a discrete choice with the additional Q-Sort scenarios. 98 CASE STUDY 1 RESULTS As expected, the standard discrete choice model without the Q-Sort augmentation yielded nonsensical results for the promotion gift attribute. Some of the promotions we tested included prepaid gift cards. As seen in Table 2, before integrating the Q-Sort data, we saw odd reversals, for example, the $100 and $300 prepaid cards were preferred over the $500 card on many of the individual level estimates. When the Q-Sort augment was applied to the model the reversals disappeared almost completely. The prepaid card ordering was logical (the $500 card was most preferred) and rank-ordering made sense for other items in the list. Table 2. Case Study 1 Summary of Individual Level Reversals Individual Reversals DCM Only DCM + Q-Sort $100 > $300 $300 > $500 $100 > $500 59.8% 0.0% 60.8% 0.8% 82.3% 0.0% As a second validation, we assigned approximate MSRP figures to the promotions, figuring they would line up fairly well with preferences. As seen in Figure 1, when plotting the DCM utilities against the MSRP values, the original model had a 29% r-square. After integrating the QSort data, we saw the r-square increase to 58%. Most ill-fitting results were due to premium offerings, where respondents likely would not see MSRP as a good indicator of value. Figure 1. Case Study 1 Comparison of Average Utilities and MSRP Priorities one and two mentioned above seem to be met in this case study. The augmented model gave us gave us estimates which we believe are closer to true preference, and our augmented model better matches with the external check to MSRP. The third priority of getting improved hit rates and MAEs with the augmented model proved more elusive with this data set. The Augmented model did not significantly improve holdout hit rates or MAE (see Table 4). This was a somewhat puzzling result. One explanation is that shares were very low in this study, in the 1% to 5% range. The promotion attribute does not add any additive predictive value because the hit rate is very high and has very little room for improvement. As a validation of this theory 99 we estimated a model without the promotion attribute, completely dropping it from the model. We were still able to obtain 93% holdout hit rates in this model confirming that the promotion attribute did not add any predictive power to our model. In a discrete choice model like this, people’s choices might be largely driven by the top few attributes, yet when the market offerings tie on those attributes, then mid-level attributes (like promotions in this study) will matter more. Our main goal in this study was to get a more sensible and stable read on the promotion attribute and not explicitly trying to improve the model hit rates. We are also not disappointed that the hit rate and MAE were not improved because the hit rate and MAE showed a high degree of accuracy in both instances. Table 3. Case Study 1 Promotion Rank Orderings, Models, MSRP, Q-Sort MSRP DCM Only DCM + Q-Sort Rank Rank Rank Promotion $500 Gift Card Tablet 1 Home Theatre Speakers Mini Tablet $300 Gift Card Medium Screen TV Gaming Console 2 E-reader Gaming Console 1 Tablet 2 Small Screen TV HD Headphones 3D Blu-Ray Player Headphones $100 Gift Card 12 month Gaming Subscription 1 2 3 4 5 6 7 8 8 8 8 8 8 14 15 16 5 2 7 1 4 9 15 12 13 6 8 14 11 10 3 16 1 3 7 5 2 4 14 9 12 8 13 15 10 11 6 16 Q-Sort Rank 1 3 7 5 2 4 12 8 11 9 13 15 14 10 6 16 Table 4. Case Study 1 Comparison of Hit Rates, MAE, and Importances DCM Only DCM + Q-Sort 93.4% 93.3% Holdout hit rate 0.0243 0.0253 MAE 7.3% 14.5% Average Importance One problem with augmenting discrete choice data is that it often will artificially inflate the importance of the augmented attribute relative to the non-augmented attributes. Our solution to this problem was to scale back the importances to the original un-augmented model importance at the individual level. We feel that there are also some other possible solutions that could be 100 investigated further. For example, we could apply a scaling parameter such that we scale the augmented parameter as well as minimizing the MAE or maximizing the hit rate. An alternative to augmenting for completely removing all reversals is to constrain at the respondent level using MSRP information. Our constrained model maintained the 93% holdout hit rates and comparable levels of MAE. The constrained model also deflated the average importance of the promotion attribute to 4.8%. We thought it was better to augment in this case study since there are additional trade-offs to be considered besides MSRP. For example, a respondent might value a tablet at $300 but might prefer a product with a lower MSRP because they already own a tablet. CASE STUDY 2 STRUCTURE We conducted a second case study which was also a discrete choice model in the telecom space. The attributes included 16 distinct features that had binary (included/excluded) levels. Other attributes in the study also included annual and monthly fees. After the choice task, respondents were asked about their interest in using each feature (separately) on a 1–10 rating scale. This external task took up to 2 minutes for respondents to complete. If respondents answered 9 or 10 then we added extra scenarios to the regular choice scenarios. Each of these extra scenarios was a binary comparison, each item vs. a “none” alternative. CASE STUDY 2 RESULTS As expected the rank ordering of the augmented task aligned slightly better than what we would intuitively expect (rank orders shown in Table 5). As far as gauging to an external source, this case study was a little bit more difficult than the previous one because we could not assign something as straightforward as MSRP to the features. We looked at the top rated security software products (published by a security software review company) and counted the number of times the features were included in each and ranked them. Figure 2 shows this comparison. Here we feel the need to emphasize key reasons not to constrain to an external source. First, often it is very difficult to find external sources. Second, if an external source is found, it can be difficult to validate and have confidence in. Last, even if there is a valid external source such as MSRP, it still might not make sense to constrain given that there could be other value tradeoffs to consider. Similar to the first case study, we did not see improved hit rates or MAEs in the second case study. Holdout hit rates came out at 98% for both augmented and un-augmented models, and MAEs were not statistically different from each other. We are not concerned with this nonimprovement because of the high degree of accuracy of both the augmented and non-augmented models. 101 Table 5. Case Study 2 Rank Orders Feature Stated Rank DCM Only Rank Order Order DCM + Stated Rank Order Feature 1 1 1 1 Feature 2 2 3 3 Feature 3 3 2 2 Feature 4 4 5 4 Feature 5 5 4 5 Feature 6 6 9 7 Feature 7 7 11 9 Feature 8 8 15 10 Feature 9 9 6 6 Feature 10 10 8 8 Feature 11 11 13 14 Feature 12 12 10 13 Feature 13 13 12 12 Feature 14 14 7 11 Feature 15 15 14 15 Feature 16 16 16 16 Figure 2. Case Study 2 External Comparison DISCUSSION AND CONCLUSIONS Augmenting a choice model with a Q-Sort or ratings battery can improve the model in the following ways. First, the utility values are more logical and fit better with the respondents’ true 102 values for the attribute levels. Second, the utility values have a better fit with external sources of value. It is not a given that holdout hit rates and MAE are improved with augmentation, although we would hope that they would be in most conditions. We feel that our hit rates and MAE did not improve in these cases because of the low likelihood of choice in the products we studied and the already high pre-augmentation hit rates. There are tradeoffs to consider when deciding to augment or constrain models. First, there is added respondent burden in asking the additional Q-Sort or other exercise used for augmentation. In our cases the extra information was collected in less than two additional minutes. Second, there is additional modeling and analysis time spent to integrate the augmentation. In our cases the augmented HB models took 15% longer to converge. Third, there is a tendency for the attribute that is augmented to have inflated importances or sensitivities and we suggest scaling the importances by either minimizing MAE or using the un-augmented importances. Lastly, one should consider reliability of external sources to check the augmentation against or to use for constraining. Brent Fuller Michael Smith APPENDIX Figure 3 shows an example of an appended augmented scenario from the first case study. In scenario 13, item 15 was chosen as the highest-ranking item from the Q-Sort exercise. All other attributes for the augmented tasks are coded as 0. Figure 3. Example of un-augmented Coding matrix Scenario Alternative 1 1 1 2 1 3 1 4 … … 12 1 12 2 12 3 12 4 y 1 0 0 0 … 0 0 1 0 tv_2 0 1 0 0 … 0 0 1 -1 tv_3 0 0 0 0 … 0 0 0 -1 tv_4 0 0 1 0 … 1 0 0 -1 tv_5 1 0 0 0 … 0 0 0 -1 tv_6 0 0 0 1 … 0 1 0 -1 promo_1 promo_2 promo_3 promo_4 promo_5 promo_6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … … … … … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 … … … … … … … … … … promo_19 1 0 0 0 … 0 1 0 0 103 Figure 4. Example of Augmented Coding Matrix Scenario Alternative 1 1 1 2 1 3 1 4 … … 12 1 12 2 12 3 12 4 13 1 13 2 13 3 13 4 13 5 13 6 13 7 13 10 13 12 13 13 13 14 13 15 13 16 13 17 13 18 13 19 y 1 0 0 0 … 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 tv_2 0 1 0 0 … 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tv_3 0 0 0 0 … 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tv_4 0 0 1 0 … 1 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tv_5 1 0 0 0 … 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tv_6 0 0 0 1 … 0 1 0 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 promo_1 promo_2 promo_3 promo_4 promo_5 promo_6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … … … … … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … … … … … … … … … … … … … … … … … … … … … … … … … … promo_19 1 0 0 0 … 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 REFERENCES Hendrix, Phil and Drucker, Stuart (2007), “Alternative Approaches to MaxDiff with Large Sets of Disparate Items—Augmented and Tailored MaxDiff” 2007 Sawtooth Software Conference Proceedings, 169–187. Lattery, Kevin (2009), “Coupling Stated Preferences with Conjoint Tasks to Better Estimate Individual-Level Utilities” 2009 Sawtooth Software Conference Proceeding, 171–184. 104 MAXDIFF AUGMENTATION: EFFORT VS. IMPACT URSZULA JONES TNS JING YEH MILLWARD BROWN BACKGROUND In recent years MaxDiff has become a household name in marketing research as it is more and more commonly used to assess the relative performance of various statements, products, or messages. As MaxDiff grows in popularity, it is often called upon to test a large number of items; requiring lengthier surveys in the form of more choice tasks per respondent in order to maintain predictive accuracy. Oftentimes MaxDiff scores are used as inputs to additional analyses (e.g., TURF or segmentation), therefore a high level of accuracy for both best/top and worst/bottom attributes is a must. Based on standard rules of thumb for obtaining stable individual-level estimates, the number of choice tasks per respondent becomes very large as the number of items to be tested increases. For example, 40 items requires 24–30 choice tasks per respondent (assuming 4–5 items per task). Yet at the same time our industry, and society in general, is moving at a faster pace with decreasing attention spans; therefore necessitating shorter surveys to maintain respondent engagement. Data quality suffers at 10 to 15 CBC choice tasks per respondent (Tang and Grenville 2010). Researchers, therefore, find ourselves being pulled by two opposing demands— the desire to test larger sets of items and the desire for shorter surveys—and faced with the consequent challenge of balancing predictive accuracy and respondent fatigue. To accommodate such situations, researchers have developed various analytic options, some of which were evaluated in Dr. Ralph Wirth and Anette Wolfrath’s award winning paper “Using MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of a New Approach.” In Express MaxDiff each respondent only evaluates a subset of the larger list of items based on a blocked design and the analysis leverages HB modeling (Wirth and Wolfrath 2012). In Sparse MaxDiff each respondent sees each item less than the rule of thumb of 3 times (Wirth and Wolfrath 2012). In Augmented MaxDiff, MaxDiff is supplemented by Q-Sort informed phantom MaxDiff tasks (Hendrix and Drucker 2007). Augmented MaxDiff was shown to have the best predictive power, but comes at the price of significantly longer questionnaires and complex programming requirements (Wirth and Wolfrath 2012). Thus, these questions still remained: 1. Given the complex programming and additional questionnaire time, is Augmented MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job? 2. If augmentation is valuable, how much is needed? 3. Could augmentation be done using only “best” items or should “worst” items also be included? 105 CASE STUDY AND AUGMENTATION PROCESS To answer these questions regarding Augmented MaxDiff, we used a study of N=676 consumers with chronic pain. The study objectives were to determine the most motivating messages as well as the combination of messages that had the most reach. Augmented MaxDiff marries MaxDiff with Q-Sort. Respondents first go through the MaxDiff section per usual, completing the choice tasks as determined by an experimental design. Afterwards, respondents complete the Q-Sort questions. The Q-Sort questions allow researchers to ascertain additional inferred rankings on the tested items. The Q-Sort inferred rankings are used to create phantom MaxDiff tasks, MaxDiff tasks that weren’t actually asked to respondents, but researchers can infer from other data what the respondents would have selected. The phantom MaxDiff tasks are used to supplement the original MaxDiff tasks and thus create a super-charged CHO file (or response file) for utility estimation. See Figure 1 for an overview of the process. Figure 1: MaxDiff Augmentation Process Overview In our case study, there were 46 total messages tested via Sparse MaxDiff using 13 MaxDiff questions, 4 items per screen, and 26 blocks. Following the MaxDiff section, respondents completed a Q-Sort exercise. Q-Sort can be done in a variety of ways. In this case, MaxDiff responses for “most” and “least” were tracked via programming logic and entered into the two Q-Sort sections—one for “most” items and one for “least” items. The first question in the Q-Sort section for “most” items showed respondents the statements they selected as “most” throughout the MaxDiff screens and asked them to choose their top four. The second question in the Q-Sort section for “most” items asked respondents to choose the top one from the top four. The Q-Sort section for “least” items mirrored the Q-Sort section for “most” items. The first question in the Q-sort section for “least” items showed respondents the statements they selected 106 as “least” throughout the MaxDiff screens and asked them to choose their bottom four. The second question in the Q-Sort section for “least” items asked respondents to choose the bottom one from the bottom four. See Figure 2 for a summary of the Q-Sort questions. Figure 2: Summary of Q-Sort Questions From the Q-Sort section on “most” items, for each respondent researchers know: the best item from Q-Sort, the second best items from Q-Sort (of which there are three), and the third tier of best items from Q-Sort (the remaining “most” items not selected in Q-Sort section for “most” items). And from the Q-Sort section on “least” items, for each respondent researchers also know: the worst item from Q-Sort, the second to worst items from Q-Sort (of which there are three), and the third tier of worst items from Q-Sort (the remaining “least” items not selected in Q-Sort section for “least” items). The inferred rankings from this data are custom for each respondent, but at a high level we know: The best item from Q-Sort (1 item) > All other items The second best items from Q-Sort (3 items) > All other items except the best item from Q-Sort The worst item from Q-Sort (1 item) < All other items The second to worst items from Q-Sort (3 items) < All other items except the least item from Q-Sort Using these inferred rankings, supplemental phantom MaxDiff tasks are created. Although respondents were not asked these questions, their answers can be inferred, assuming that respondents would have answered the new questions consistently with the observed questions. 107 Since the MaxDiff selections vary by respondent; the Q-Sort questions, the inferred rankings from Q-Sort, and finally the phantom MaxDiff tasks are also customized to each respondent. From respondents’ Q-Sort answers, we created 18 supplemental phantom MaxDiff tasks as well as the inferred “most” (noted with “M”) and “least” (noted with “L”) responses. See Figure 3 for the phantom MaxDiff tasks we created. Figure 3: Supplemental phantom MaxDiff tasks using both the Q-Sort section on “most” items and the Q-Sort section on “least” items Respondents’ Q-Sort answers are matched to the supplemental Q-Sort-Based MaxDiff tasks (i.e., the phantom MaxDiff tasks) to produce a new design file that merges both the original MaxDiff design and the Q-Sort supplements. Figure 4 illustrates the process in detail. 108 Figure 4: Generating a new design file that merges original MaxDiff with Q-Sort supplements (i.e., phantom MaxDiff tasks) Likewise respondents’ Q-Sort answers are used to infer their responses to the phantom MaxDiff tasks to produce a new response file that merges both responses to the original MaxDiff and the Q-Sort supplements. The new merged design and response files are combined in a supercharged CHO file and used for utility estimation. Figure 5 provides an illustration. 109 Figure 5: Generating a new response file that merges original MaxDiff responses with responses to Q-Sort supplements (i.e., phantom MaxDiff tasks) EXPERIMENT Our experiment sought to answer these questions: 1. Given the complex programming and additional questionnaire time, is Augmented MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job? 2. If augmentation is valuable, how much is needed? 3. Could augmentation be done using only “best” items or should “worst” items also be included? To answer these questions, the authors compared the model fit of Sparse MaxDiff with Augmented MaxDiff when two types of Q-Sort augmentations are done: Augmentation of best (or top) items only. Augmentation of both best and worst (or top and bottom) items. Recall that we generated supplemental phantom MaxDiff tasks using both the Q-Sort section for the “most” items and the Q-Sort section for the “least” items (see Figure 3). When testing augmentation including only “best” items we created supplemental phantom MaxDiff tasks using only the Q-Sort section for the “most” items as shown in Figure 6. Again, responses for “most” (noted with “M”) and “least” (noted with “L”) for each phantom tasks can be inferred. 110 Figure 6: Supplemental phantom MaxDiff tasks using only the Q-Sort section on “most” items. We also evaluated the impact of degree of augmentation on model fit by examining MaxDiff Augmentation including 3, 5, 7, 9, and 18 supplemental phantom MaxDiff tasks. FINDINGS As expected, heavier augmentation improves fit. Heavier augmentation appends more Q-Sort data and Q-Sort data is presumably consistent to MaxDiff data. Thus heavier augmentation appends more consistent data and we therefore expected overall respondent consistency measurements to increase. Percent Certainty and RLH are both higher for heavy augmentation compared to Sparse MaxDiff (i.e., no augmentation) and lighter augmentation as shown in Figure 7. 111 Figure 7: Findings from our estimation experiment Surprisingly Best-Only Augmentation outperforms Best-Worst Augmentation even though less information is used with Best-Only Augmentation (Best-Only % Cert=0.85 and RLH=0.81; Best-Worst % Cert=0.80 and RLH=0.76). To further understand this unexpected finding, we did a three-way comparison of (1) Sparse MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst Augmentation (“Q-Sort Best & Worst”). The results showed that at the aggregate level, the story is the same regardless of whether augmentation is used. Spearman’s Rank Order correlation showed strong, positive, and statistically significant correlations between the three indexed MaxDiff scores with rs(46)>=0.986, p=.000 (see Figure 8). Note that for the two cases when augmentation was employed for this test, we used 18 supplemental phantom MaxDiff tasks. 112 Figure 8: Spearmans Rank Order Correlation for indexed MaxDiff scores from (1) Sparse MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst Augmentation (“Q-Sort Best & Worst”) We further compared matches between (1) the top and bottom items based on MaxDiff scores versus (2) top and bottom items based on Q-Sort selections. An individual-level comparison was used to show the percent of times there is a match between Q-Sort top four items and MaxDiff top four items as well as between Q-Sort bottom four items and MaxDiff bottom four items. We found that at the respondent level the model is imprecise without augmentation. In particular, the model is less precise at the “best” end compared to the “worst” end (45% match on “best” items vs. 51% match on “worst”). In other words, researchers can be more sure about MaxDiff items that come out at the bottom as compared to MaxDiff items that come out at the top. This finding was consistent with results from a recent ART forum paper by Dyachenko, Naylor, & Allenby (2013). The implication for MaxDiff Augmentation is that augmenting on best items is critical due to the lower precision around those items. CONCLUSIONS As expected, at the respondent level Sparse MaxDiff is more imprecise compared to Sparse MaxDiff with augmentation. However, at the aggregate level the results of Sparse MaxDiff are similar to results with augmentation. Therefore, for studies where aggregate level results are sufficient, Sparse MaxDiff is suitable. But for studies where stability around individual-level estimates is needed, augmenting Sparse MaxDiff is recommended. By augmenting on “best” items only, researchers can get a better return with a shorter questionnaire and less complex programming compared to augmenting on both “best” and “worst” items. MaxDiff results were shown to be less accurate at the “best” end and augmentation on “best” items improved fit. Best-only augmentation requires a shorter questionnaire compared to augmentation on both “best” and “worst” items. Heavy augmenting whether augmenting on “best” only or both “best” and “worst” is critical when other analyses (e.g., TURF, clustering) are required. The accuracy of utilities estimated 113 from heavy augmentation was superior to the accuracy of utilities estimated from lighter augmentation. Finally, if questionnaire real estate allows, obtain additional information from which the augmentation can benefit. For example, in the Q-Sort exercise, instead of asking respondents to select the top/bottom four and then the top/bottom 1; ask respondents to rank the top/bottom 4. This additional ranking information allows more flexibility in creating the item combinations for the supplemental phantom MaxDiff tasks, and we therefore hypothesize, better utility estimates. Urszula Jones Jing Yeh REFERENCES Dyachenko, Tatiana; Naylor, Rebecca; & Allenby, Greg (2013), “Models of Sequential Evaluation in Best-Worst Choice Tasks,” 2013 Sawtooth Software Conference Proceedings. Tang, Jane; Grenville, Andrew (2012), “How Many Questions Should You Ask in CBC Studies?—Revisited Again,” 2010 Sawtooth Software Conference Proceedings, 217–232. Wirth, Ralph; Wolfrath, Anette (2012), “Using MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of a New Approach,” 2012 Sawtooth Software Conference Proceedings, 99–109. 114 WHEN U = BX IS NOT ENOUGH: MODELING DIMINISHING RETURNS AMONG CORRELATED CONJOINT ATTRIBUTES KEVIN LATTERY MARITZ RESEARCH 1. INTRODUCTION The utility of a conjoint alternative is typically assumed to be a simple sum of the betas for each attribute. Formally, we define utility U = βx. This assumption is very reasonable and robust. But there are cases where U = βx is not enough, it is just too simple. One example of this, which is the focus of this paper, is when the utilities of attributes in a conjoint study are correlated. Correlated attributes can arise in many ways, but one of the most prevalent ways is with binary yes/no attributes. An example of binary attributes appears below where the attributes marked with an x mean that the benefit is being offered, and a blank means it is not. Non Binary Attribute1 Non Binary Attribute2 Program 1 Level 1 Level 2 Discounts on equipment purchases Access to online equipment reviews by other members Early access to new equipment (information, trial, purchase, etc.) Custom fittings x Members-only logo balls, tees, tags, etc x Program 2 Program 3 Level 3 Level 2 Level 2 Level 3 x x x x x x x If the binary attributes are correlated, then adding more benefits does not give a consistent or steady lift to utility. The result is that the standard U = βx model tends to over-predict interest when there are more binary features (Program 1 above) and under-predict product concepts that have very few of the features (Program 3 above). This can be a critical issue when running simulations. One common marketing issue is whether a lot of smaller cheaper benefits can compensate for more significant deficiencies. Clients will simulate a baseline first product, and then a second cheaper product that cuts back on substantial features. They will then see if they can compensate for the second product’s shortcomings by adding lots of smaller benefits. We have seen many cases where simulations using the standard model suggest a client can improve a baseline product by offering a very inferior product with a bunch of smaller benefits. And even though this is good news to the client who would love this to be true, it seems highly dubious even to them. 115 The chart below shows the results of one research study. The x-axis is the relative number of binary benefits shown in an alternative vs. the other alternative in its task. This conjoint study showed two alternatives with only 1 to 3 binary benefits, so we simply take the difference in the number of benefits shown. So showing alternative A with 1 benefit and alternative B with 3 benefits, we have (1 - 3) = -2 and (3 - 1) = 2. The difference in binary attributes is -2 and +2. The vertical axis is corresponding error, Predicted Share—Observed Share, for the in-sample tasks. This kind of systematic bias, where we overstate the share as we add more benefits, is extremely problematic when it occurs. While this paper focuses on binary attributes, the problem occurs in other contexts, and the new model we propose can be used in those as well. This paper also discusses some of the issues involved when we change the standard conjoint model to something other than U = βx. In particular, we discuss how it is not enough to simply change the functional form when running HB. Changing the utility function also requires other significant changes in how we estimate parameters. 2. LEARNING FROM CORRELATED ALTERNATIVES There is not much discussion in the conjoint literature about correlated attributes, but there is a large body of work on correlated alternatives. Formally, we talk the “Independence of Irrelevant Alternatives” (IIA) assumption, where the basic logit model assumes the alternatives are not correlated. The familiar blue bus/red bus example is an example of correlated alternatives. Introducing a new bus that is the same as the old bus with a different color creates correlation among the two alternatives: red bus and blue bus are correlated. The result is parallel to what we described above for correlated attributes: the model over-predicts how many people ride the bus. In fact, if we keep adding the same bus with different colors—yellow, green, fuchsia, etc.—the basic logit model will eventually predict that nearly everyone rides the bus. 116 One of the classic solutions to correlated alternatives is Nested Logit. With Nested Logit, alternatives within a nest are correlated with one another. In the case of buses, one would group the various colors of buses together into one nest of correlated alternatives. The degree of correlation is specified by an additional λ parameter. When λ=1, there is no correlation among the alternatives in a nest, and as λ goes to 0, the correlation increases. One solves for λ, given the specific data. In the case of red bus/blue bus, we expect λ near 0. The general model we propose follows the mathematical structure of nested logit, but instead of applying the mathematical structure to alternatives, we apply it to attributes. This means we think of the attributes as being in nests, grouping together those attributes that are correlated, and measuring the degree of that correlation by introducing an additional λ parameter. In our case, we are working at the respondent level, and each respondent will have their own λ value. For some respondents, a nest of attributes may have little correlation, while for others the attributes may be highly correlated. Recall that the general formulation for a nested logit, where the nest is composed of alternatives 1 through n with non-exponentiated utilities of U1 . . . Un is: eU = [(eU1) 1/λ + (eU2) 1/λ + ... + (eUn)1/λ ]λ where 0 < λ <=1, and 1 - λ can be interpreted as the correlation among the alternatives. and U gives the overall utility of the nest as a whole, which is then used in the multinomial logit in the usual way. We can adapt this formulation to create a nest of attributes in the following way. Consider a defined nest of n attributes with betas B1 . . . Bn, and corresponding indicators x1 . . . xn (each indicator being 1 if the attribute applies to an alternative, and zero otherwise). Then the standard utility for the group of attributes is: U = [(x1B1) + (x2B2) + ... + (xnBn)] and the new nested-like formulation is: U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ Each set of attributes that is grouped together in a nest would have its own λ parameter. For instance, we might group attributes 1–5 together in one nest, and attributes 6–9 in another. We would then compute the new utility U for each nest separately and add them. We could even employ hierarchies of nests as we do with nested logit. One important caveat is that each Bi>=0. We are talking about diminishing returns, so we are assuming each beta is positive, but the return may diminish when it is with other attributes in the same nest. (If the betas were not all positive the idea of diminishing returns would not make any sense.) This new formulation has several excellent properties, which it shares with nested logit. 1) When λ =1, the formulation reduces to the standard utility. This is also its maximum value. 2) As λ shrinks towards 0, the total utility also shrinks. At the limit of λ =0, the total utility is just the utility of the attribute in the nest with the greatest value. 3) The range of utility values is from the utility of the single best attribute to the simple sum of the attributes. We are shrinking the total utility from the simple sum down to an amount that is at least the single highest attribute. 117 4) Adding a new attribute (changing xi from 0 to 1) will always add some amount of utility (though it could be very close to 0). It is not possible to have reversals, where adding an item shrinks the total utility down lower than it was before adding the item. 5) The amount of shrinkage depends upon the size of the betas and their relative difference. Betas that are half as large will shrink half as much. In addition, the amount of shrinkage also depends upon their relative size. Three nearly equal betas will shrink differently than 3 where one is much smaller. Before turning to the results using this nested attribute formulation, I want to briefly describe and evaluate some other methods that have been proposed to deal with the problem of diminishing returns among correlated attributes. 3. ALTERNATIVE SOLUTIONS TO DIMINISHING RETURNS There are three alternative methods I will discuss here, which other practitioners have shared with me in their efforts to solve the problem of diminishing returns. Of course, there are likely other methods as well. The advantage of all the alternative methods discussed here is that they do not require one to change the underlying functional form, so they can be easily implemented, with some caveats. 3.1 Recoding into One Attribute with Many Levels One method for modeling a set of binary attributes is to code them into one attribute with many levels. For example, 3 binary attributes A, B, C can be coded into one 7-level attribute: A, B, C, AB, AC, BC, ABC (or an 8-level, if none is included as a level). This gives complete control over how much value each combination of benefits has. To implement it, you would also need to add six constraints: – – – – – – AB > A AB > B AC > A AC > C ABC > AB ABC > AC With only 2–3 attributes, this method works fine. It adds quite a few more parameters to estimate, but is very doable. When we get to 4 binary attributes, we need to make a 15- or 16level attribute with 28 constraints. That is more levels and constraints than most will feel comfortable including in the model. With 5 attributes, things are definitely out of hand, requiring an attribute with 31 or 32 levels and many constraints. While this method gives one complete control over every possible combination, it is not very parsimonious, even with 4 binary attributes. My recommendation is to use this approach only when you have 2–3 binary attributes. 3.2 Adding Interaction Effects A second approach is to add interaction terms. For example, if one is combining attribute A and B, the resulting utility is A + B - AB, where the last term is an interaction value that represents the diminishing returns. When AB is 0, there is no diminishing. 118 It is not clear how this works with more attributes. With 3 binary attributes A, B, C, the total utility might be A + B + C - (AB + AC + BC + ABC). But notice that we have lots of interactions to subtract. When we get to 4 attributes, it is even more problematic, as we have 11 interactions to subtract out. If we use all possible interactions, we wind up with one degree of freedom per level, just as in the many-leveled attribute approach, so this method is like that of 3.1 in terms of number of parameters. One might reduce the number of parameters by only subtracting the pairwise interactions, imposing a little more structure on the problem. Even pairs alone can still be quite a few interactions, however. The bigger problem with this method is how to make the constraints work. We must be careful not to subtract out too much for any specific combination. Otherwise, it will appear that we add a benefit, and actually reduce the utility, a “reversal.” This means trying to set up very complex constraints. So we need constraints like (A + B + C) > (AB + AC + BC + ABC). Each interaction term will be involved in many such constraints. Such a complex constraint structure will most likely have a negative impact on estimation. I see no value in this approach, and recommend the previous recoding option in 3.1 as much easier to implement. 3.3 Adding One or More Variables for Number of Items The basic idea of this approach is like that of subtracting interactions, but using a general term for the interactions based on the number of binary items, rather than estimating each interaction separately. So for two attributes A and B, the total utility would be: A + B - k, and for 3 attributes it might be A + B + C - 2k In this case, k represents the amount we are subtracting for each additional benefit. With 5 benefits, we would be subtracting 4k. The modeling involves adding an additional variable to the design code. This new variable counts the number of binary attributes in the alternative. So an alternative with 5 binary attributes has a new variable with a value of 4 (we use number of attributes - 1). The value k can be estimated just like any other beta in the model, constrained for the proper sign. One can even generalize the approach by making it non-linear, perhaps adding a squared term for instance. Or we can make the amount subtracted vary for each number of benefits. So for two attributes: A + B - m, and for 3 attributes it might be A+ B+ C - n In this very general case, we model a specific value for each number of binary attributes, with constraints applied so that we subtract out more as the number of attributes increases (so n > m above). The number of additional parameters to estimate is the range of the number of attributes in an alternative (minus 1). So if the design shows alternatives ranging from 1 binary benefit to 9, we would estimate 8 additional parameters, ordinally constrained. I think this approach is very clever, and in some cases it may work. The general problem I have with it is that it can create reversals. Take the simple case of two binary attributes A + B - k. In that case, k should be smaller than A or B. The same applies to different pairs of attributes. In the case of C + D - k, we want k to also be less than C or D. In general we want k to be less than the minimum value of all the binary attributes. Things get even more complicated when we turn 119 to 3 attributes. In the end, I’m not certain these constraints can be resolved. One may just have to accept reversals. In the results section, I will show the degree to which reversals occurred using this method in my data sets. This is not to say that the method would never be an improvement over the base model, but one should be forewarned about the possibility of reversals. Another disadvantage of this method is that shrinkage is purely a function of the number of items. Given 3 items, one will always subtract out the same value regardless of what those items are. In contrast, the nested logit shrinks the total utility based on the actual betas. So with the nested formulation, 3 items with low utility are shrunk less than 3 items with high utility, in proportion to their scale, based on their relative size to each other, and the shrinkage cannot exceed the original effects. Basing shrinkage on the actual betas is most likely better than shrinkage based solely on the number of items. That said, making shrinkage a function of the betas does impose more difficulty in estimation—a topic we turn to in the next section. 4. ESTIMATION OF NESTED ATTRIBUTE FORMULATION (LATENT CLASS ENSEMBLES AND HB) Recall that the proposed formula for a nest of attributes 1 through n, with corresponding betas Bi and indicators xi is: U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ , where λ ϵ (0,1], Bi >=0 In the results section, we estimated this model using Latent Class Ensembles. This method was developed by Kevin Lattery and presented in detail in an AMA ART Forum 2013 paper, “Improving Latent Class Conjoint Predictions With Nearly Optimal Cluster Ensembles.” As we mention there, the algorithm is run using custom SAS IML code. We give a brief description of the method below. Latent Class Ensembles is an extension of Latent Class. But rather than one Latent Class solution, we develop many Latent Class solutions. This is easy to do because Latent Class is subject to local optima. By using many different starting points and relaxing the convergence criteria (max LL/min LL > .999 for last 10 iterations), we create many different latent class solutions. These different solutions form an ensemble. Each member of the ensemble (i.e., each specific latent class solution) gives a prediction for each respondent. We average over those predictions (across the ensemble) to get the final prediction for each respondent. So if we generate 30 different Latent Class solutions, we get 30 different predictions for a specific respondent, and we average across those 30 predictions. This method significantly improves the predictive power versus a single Latent Class solution. The primary reason for using Latent Class Ensembles is that the model was easier to estimate. One of the problems with nested logit is that it is subject to local maxima. So it is common to estimate nested logits in three steps: 1) Assume λ =1, and estimate the betas, 2) Keep betas fixed from 1), and estimate λ, 3) Using betas and λ from 2) as starting points to estimate both simultaneously. We employed this same three-step process in the latent class ensemble approach. So each iteration of the latent class algorithm required three regressions for one segment, rather than one. But other than that, we simply needed to change the function and do logistic regression. One 120 additional wrinkle is that we estimated 1/ λ rather than λ. We then set the constraint on 1/ λ to range from 1 to 5. Using the inverse broadens the range of what is being estimated, and makes it easier to estimate than the equivalent λ ranging from .2 to 1. In some cases I have capped 1/ λ at 10 rather than 5. If one allows 1/ λ to get too large, it can cause calculation overflow errors. At .2, the level of shrinkage is quite close to that one would find at the limit of 0. Our intent was to also use the nested formulation within HB. That however proved to be more difficult. Simply changing the functional form and constraining 1/ λ did not work at all. HB did not converge, and any result we tried performed more poorly than the traditional non-nested formula. We tried estimation first using our Maritz specific HB package in SAS and then also using the R package ChoiceModelR. Both of these had to be modified for the custom function. One problem is that λ is not a typical parameter like the betas. Instead, λ is something we apply to the betas in exponential form. So within HB, λ should not be included in the covariance matrix of betas. In addition, we should draw from a separate distribution for λ, including a separate jumping factor in the Gibbs sampler. This then raises the question of order of estimation, especially given the local optima that nested functions often have. My recommendation is as follows: 1) Estimate the betas without λ (assume λ =1) 2) Estimate 1/λ using its own 1-attribute variance matrix, assuming fixed betas from above 3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance matrix in 2) We have not actually programmed this estimation method, but offer it as a suggestion for those who would like to use HB. It is clear to us that one needs, at the very least, separate draws and covariance matrices for betas and 1/λ. The three steps above recognize that and adopt the conservative null hypothesis that λ will be 1, moving away from 1 only if the data support that. The three steps parallel the three steps taken in the Latent Class ensembles, and the procedure of nested logit more generally: sequential, then simultaneous. My recommendation would be to first do step 1 across many MCMC iterations until convergence, estimating betas only for several thousand iterations. Then do step 2 across many MCMC iterations, estimating λ. Finally, do step 3, which differs from the first two steps in that each iteration estimates new betas and then new λs. 5. EMPIRICAL RESULTS Earlier we showed the results for the standard utility HB model. When we apply the nested attribute utility function (estimated with Latent Class ensembles), we get the following result. We show the new nested method just to the right of the standard method that was shown in the original figure: 121 Nested Model Standard Utility The nested formulation not only shows a reduction in error, but removes most of the number of item bias. The error is now almost evenly distributed across the relative number of items. The data set above is our strongest case study of the presence of item bias. We believe there a few reasons for this, which make this data set somewhat unique. First, this study only measured binary attributes. Most conjoint studies have a mixture of binary and non-binary attributes. When there is error in the non-binary attributes, the number of attributes bias is less apparent. A second reason is that this study showed exactly two alternatives. This makes it easy to plot the difference in the number of items in each task. If alternative A showed 4 binary attributes and alternative B showed 3 binary attributes, the difference is +1 for A and -1 for B. In most of our studies we have more than 2 alternatives. So the relative number of binary attributes is more complicated. The next case study had a more complex design, with 3 alternatives and both binary and non-binary attributes. Here is a sample task: 122 Points Earned Room Upgrades Food & Beverage Discount Frequent Flyer Miles Additional Benefits Loyalty Program 1 100 pts per $100 None None 1,500 miles per stay • Early check-in and extended check-out Loyalty Program 2 Loyalty Program 3 200 pts per $100 400 pts per $100 2 per year Unlimited 10% Off 15% Off 500 per stay 1,000 miles per stay • 2 one-time guest passes • Complimentary to an airport lounge welcome gift/snack • Turndown service • Gift shop discount • Priority check-in and express check-out • Complimentary breakfast • No blackout dates for reward nights In this case, we estimated the relative number of binary attributes by subtracting from the mean. The mean number of binary attributes in the example above is (5+2+1)/3 = 2.67. We then subtract the mean from the number of binary attributes in each alternative. So alternative 1 has 5 - 2.67 = 2.33 more binary attributes than average. Alternative 2 has 2 - 2.67 = -.67, and alternative 3 has 1 - 2.67 = -1.67. Note that these calculations are not used in the modeling; we are only doing them to facilitate plotting the bias in the results. The chart below shows the results for each in-sample task. The correlation is .33, much higher than it should be; ideally, there should be no relationship here. The slope of the line is 2.4%, which means that each time we add an attribute we are overstating share by an average of 2.4%. This assumes a 3-alternative scenario. With two alternatives that slope would likely be higher. Clearly there is systematic bias. It is not as clean as 123 the first example, in part because we have noise from other attributes as well. Moreover, we should note that this is in-sample share. Ideally, one would like out-of-sample tasks, which we expect will show even more bias than above. If you see the bias in-sample it is even more likely to be found out-of-sample. Using the new model, the slope of the line relating relative number of attributes to error in share is only 0.9%. This is much closer to the desired value of 0. We also substantially reduced the mean absolute error, from 5.8% to 2.6%. The total sample size for this case study was 1,300. 24.7% of those respondents had a λ of 1, meaning no diminishing returns for them. 34.7% had a λ of .2, the smallest used in this model. The remaining 40.7% had a median of .41, so the λ values skewed lower. The chart below shows the cumulative percentage of λ values, flipped at the median. Clearly there were many λ values significantly different from 1. 124 Cumulative Distribution of Estimated λ Values (Flipped at the Median; So Cumulative from Each End to the Middle) We also fit a model using the method discussed in section 3.3, adding a variable equal to the number of binary attributes minus 1. This offered a much smaller improvement. The slope of the line bias was reduced from 2.4% to 1.8%. So if one is looking for a simple approach that may offer some help in reducing systematic error, the section 3.3 approach may be worth considering. However, we remain skeptical of it because of the problem of reversals, which we discuss below. One drawback to the number of binary attributes method is reversals. These occur when the constant we are subtracting is greater than the beta for an attribute. Properly fixing these reversals is extremely difficult. One attribute, Gift Shop Discount, showed a reversal at the aggregate level. On its own, it added benefit, but when you added it to any other existing binary attribute the predicted result was lower. Clearly, this would be counterintuitive, and need to be fixed in such a model. It turns out that every one of the 24 binary attributes had a reversal for some respondents, using the method in 3.3. In addition to “Gift Shop Discounts,” two other attributes had reversals for over 40% of the respondents. This is clearly an undesirable level of reversals, and could show up as reversals in aggregate calculations for some subgroups. For this reason, we remain skeptical of using this method. The nested logit formulation never produces reversals. 6. NOTES ON THE DESIGN OF BINARY ATTRIBUTES While the focus of this paper has been on the modeling of binary attributes, there are a few crucial comments I need to make about the experimental design with sets of binary attributes. After presenting initial versions of this paper, one person exclaimed that given a set of say 8 binary attributes, they like to show the same number of binary attributes in each alternative and task. For example, the respondent will always see 3 binary attributes in each alternative and task. This makes the screen look perfectly square, and gives the respondent a consistent experience. Of course, it also means you won’t notice any number of attribute bias. But it doesn’t mean the bias is not there; we are just in denial. In addition to avoiding (but not solving) the issue, showing the same number of binary attributes is typically an absolutely terrible idea. If one always shows 3 binary attributes, then 125 your model can only predict what happens if each alternative has the same number of binary attributes. The design is inadequate to make any other predictions. You really don’t know anything about 2 binary vs. 3 binary vs. 4 binary, etc. To explain this point, consider the following 3 alternative tasks, with the betas shown in the cells: Non-Binary1 Non-Binary2 Binary 1 Binary 2 Binary 3 Binary 4 Binary 5 Utility Exponentiated U Probability Program 1 1.0 1.5 2.0 0.5 0.2 Program 2 -3.0 1.5 2.0 1.2 0.8 5.2 181.3 81.1% 2.5 12.2 5.5% Program 3 2.0 -0.5 0.5 0.2 1.2 3.4 30.0 13.4% Now what happens if we add 2 to each binary attribute? We get the values below: Non-Binary1 Non-Binary2 Binary 1 Binary 2 Binary 3 Binary 4 Binary 5 Utility Exponentiated U Probability Program 1 1.0 1.5 4.0 2.5 2.2 Program 2 -3.0 1.5 4.0 3.2 2.8 11.2 73,130 81.1% 8.5 4,915 5.5% Program 3 2.0 -0.5 2.5 2.2 3.2 9.4 12,088 13.4% The final predictions are identical! In fact, we can add any constant value to each binary attribute and get the same prediction. So given our design, each binary attribute is really Beta + k, where k can be anything. In general this is really bad because the choice of k makes a big difference when you vary the number of attributes. Comparing a 2-item alternative with a 5 alternative item results in a 3k difference in utilities, which leads to a big difference in estimated shares or probabilities. The value of k matters in the simulations, but the design won’t let us estimate it! The only time this design is not problematic is when you keep the number of binary attributes the same in every alternative for your simulations as well as in the design. To do any modeling on how a mixed number of binary attributes works, the design must also have a mixed number of binary attributes. This is true entirely independently of whether we’re trying to account for diminishing returns. In general, we recommend you have as much 126 variability in your design as you want to simulate. Sometimes this is not reasonable, but it should always be the goal. One of the more common mistakes conjoint designers make is that they will let each binary attribute be randomly present or absent. So if one has 8 binary attributes, most of the alternatives have about 4 binary attributes. Very few, if any tasks, will show 1 binary attribute vs. 8 binary attributes. But it is important that the design show this range of 1 vs. 8 if we are to accurately model this extreme difference. My recommendation is to create designs where the range of binary attributes varies from extreme to small, and evenly distribute tasks across this range. For example, each respondent might see 3 tasks with extreme differences, 3 with somewhat extreme, 3 moderate, and 3 with minimal differences. The key is to get different levels of contrast in the relative number of binary attributes if one wants to detect and model those kinds of differences. 7. CONCLUSION The correlation of effects among attributes in a conjoint study means that our standard U = βx may not be adequate. One way to deal with that correlation is to borrow the formulation of nested logit, which was meant to deal with correlated alternatives. More specifically, the utility for a nest of attributes 1 . . . n was defined as: U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ Employing that formulation in HB has its challenges, as one should specify separate draws and covariance matrices for the betas and λ. We recommended a three stage approach for HB estimation: 1) Estimate betas without λ (assume λ =1) 2) Estimate 1/λ using its own 1 attribute variance matrix, assuming fixed betas from above 3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance matrix in 2) While we did not validate the nested logit formulation in HB, we did test a similar three step approach using the methodology of latent class ensembles: sequential estimation of betas, followed by estimation of λ and then a simultaneous estimation. The nested attribute model estimated this way significantly reduced the overstatement of share that happens with the standard model when adding correlated binary attributes. In this paper we have only discussed grouping all of the binary attributes together in one nest. But of course, that assumption is most likely too simplistic. In truth, some of the binary attributes may be correlated with each other, while others are not. Our next steps are to work on ways to determine which attributes belong together, and whether that might vary by respondent. There are several possible ways one might group attributes together. Judgment is one method, based on one’s conceptual understanding of which attributes belong together. In some cases, our judgment may even structure the survey design, as in this example: 127 Another possibility is empirical testing of different nesting models, much like the way one tests different path diagrams in PLS or SEM. We also plan to test rating scales. By asking respondents to rate the desirability of attributes we can use correlation matrices and variable clustering to determine which attributes should be put into a nest with one another. As noted in the paper, one might even create hierarchies of nests, as one does with nested logits. We have just begun to develop the possibilities of nested correlated attributes, and welcome further exploration. Kevin Lattery 128 RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE? A VARIANCE DECOMPOSITION OF HB UTILITIES KEITH CHRZAN, AARON HILL SAWTOOTH SOFTWARE INTRODUCTION Common practice among applied marketing researchers is to analyze discrete choice experiments using Hierarchical Bayesian multinomial logit (HB-MNL). HB-MNL analysis produces a set of respondent-specific part-worth utilities which researchers hope reflect heterogeneity of preferences among their samples of respondents. Unfortunately, two other potential sources of heterogeneity, version effects and utility magnitude, could create preferenceirrelevant differences in part-worth utilities among respondents. Using data from nine commercial choice experiments (provided by six generous colleagues) and from a carefully constructed data set using artificial respondents, we seek to quantify the relative contribution of version effects and utility magnitude on heterogeneity of part-worth utilities. Any heterogeneity left unexplained by these two extraneous sources may represent real differences in preferences among respondents. BACKGROUND Anecdotal evidence and group discussions at previous Sawtooth Software conferences have identified version effects and differences in utility magnitudes as sources of preference-irrelevant heterogeneity among respondents. Version effects occur when respondents who receive different versions or blocks of choice questions end up with different utilities. One of the authors recalls stumbling upon this by accident, using HB utilities in a segmentation study only to find that the needs-based segment a respondent joined depended to a statistically significant extent on the version of the conjoint experiment she received. We repeated these analyses on the first five data sets made available to us. First we ran cluster analysis on the HB utilities using the ensembles analysis in CCEA. Crosstabulating the resulting cluster assignments by version numbers we found significant 2 statistics in three of the five data sets. Moreover, of the 89 utilities estimated in the five models, analysis of variance identified significant F statistics for 31 of them, meaning they differed by version. Respondents may have larger or smaller utilities as they answer more or less consistently or as their individual utility model fits their response data more or less well. The first data set made available to us produced a fairly typical finding: the respondent with the most extreme utilities (as measured by their standard deviation) had part-worths 4.07 times larger than those for the respondent with the least extreme utilities. As measures of respondent consistency, differences in utility magnitude are one (but not the only) manifestation of the logit scale parameter (Ben Akiva and Lerman 1985). Recognizing that the response error quantified by the scale parameter can 129 affect utilities in a couple of different ways, we will refer to this effect as one of utility magnitude rather than of scale. With evidence of both version effects and utility magnitude effects, we undertook this research to quantify how much of between-respondent differences in utilities they explain. If it turns out that these explain a large portion of the differences we see in utilities across respondents we will have to question how useful it is to have respondent-level utilities: If much of the heterogeneity we observe owes to version effects, perhaps we should keep our experiments small enough (or our questionnaire long enough) to have just a single version of our questionnaire, to prevent version effects, for example; of course this raises the question of which one version would be the best one to use; If differences in respondent consistency explain much of our heterogeneity then perhaps we should avoid HB models and their illusory view of preference heterogeneity. If these two factors explain very small portions of observed utility heterogeneity, however, then we can be more confident that our HB analyses are measuring preference heterogeneity. VARIANCE DECOMPOSITION OF COMMERCIAL DATA SETS In order to decompose respondent heterogeneity into its components we needed data sets with more than a single version, but with enough respondents per version to give us statistical power to detect version effects. Kevin Lattery (Maritz Research), Jane Tang (Vision Critical), Dick McCullough (MACRO Consulting), Andrew Elder (Illuminas), and two anonymous contributors generously provided disguised data sets that fit our requirements. The nine studies differed in terms of their designs, the number of versions and of respondents they contained: Study 1 2 3 4 5 6 7 8 9 Design 5 x 43 x 32 23 items/14 quints 15 items/10 quads 12 items/9 quads 47 items/24 sextuples 90 items/46 sextuples 8 x 42 x 34 x 2 52 x 42 x 32 2 4 x 35 x 2 + NONE Number of versions 10 8 6 8 4 6 2 10 6 Total sample size 810 1,000 2,002 1,624 450 527 5,701 148 1,454 For each study we ran the HB-MNL using CBC/HB software and using settings we expected practitioners would use in commercial applications. For example, we ran a single HB analysis across all versions, not a separate HB model for each version; estimating a separate covariance matrix for each version could create a lot of additional noise in the utilities. Or again, we ran the CBC/HB model without using version as a covariate: using the covariate could (and did) exaggerate the importance of version effects in a way a typical commercial user would not. With the utilities and version numbers in hand we next needed a measure of utility magnitude. We tried four different measures and decided to use the standard deviation of a 130 respondent’s part-worth utilities. This measure had the highest correlations with the other measures; all were highly correlated with one another, however, so the results reported below are not sensitive to this decision. For variance decomposition we considered a variety of methods but they all came back with very similar results. In the end we opted to use analysis of covariance (ANCOVA). ANCOVA quantifies the contribution of categorical (version) and continuous (utility magnitude) predictors on a dependent variable (in this case a given part-worth utility). We ran the ANCOVA for all utilities in a given study and then report the average contribution of version and utility magnitude across the utilities as the result for the given study. The variance unexplained by either version effect or utility magnitude may owe to respondent heterogeneity of preferences or to other sources of heterogeneity not included in the analysis. Because of an overlap in the explained variance from the two sources we ran an averaging-over-orderings ANCOVA to distribute the overlapping variance. So what happened? The following table relates results for the nine studies and for the average across them. Study Design 1 2 3 4 5 6 7 8 9 5 x 43 x 32 23 items/14 quints 15 items/10 quads 12 items/9 quads 47 items/24 sextuples 90 items/46 sextuples 8 x 42 x 34 x 2 52 x 42 x 32 42 x 35 x 2 + NONE Mean Number of versions 10 8 6 8 4 6 2 10 6 Total sample size 810 1,000 2,002 1,624 450 527 5,701 148 1,454 Variance (magnitude) (%) 6.2 8.0 6.4 10.3 7.9 17.0 7.3 27.8 6.7 Variance (version) (%) 4.9 1.1 0.1 1.1 0.7 1.2 0 3.3 1.1 Variance (residual) (%) 88.9 90.9 93.5 88.6 91.4 81.8 92.7 68.9 92.2 10.8 1.4 87.8 On average, the bulk of heterogeneity owes neither to scale effects nor to utility magnitude effects. In other words, up to almost 90% of measured heterogeneity may reflect preference heterogeneity among respondents. Version has a very small effect on utilities—statistically significant, perhaps, but small. Version effects account for just under 2% of utility heterogeneity on average and never more than 5% in any study. Heterogeneity owing to utility magnitudes explains more—about 11% on average and as high as 27.8% in one of the studies (the subject of that study should have been highly engaging to respondents—they were all in the market for a newsy technology product; the same level of engagement should also have been present in study 7, however, based on a very high response rate in that study and the importance respondents would have placed on the topic). In other words, in some studies a quarter or more of observed heterogeneity may owe to the magnitude effects that reflect only differences in respondent consistency. Clearly magnitude effects merit attention and we should consider removing them when appropriate (e.g., for reporting of utilities but not in simulators) through scaling like Sawtooth Software’s zero-centered diffs. 131 IS THERE A MECHANICAL SOURCE OF THE VERSION EFFECT? It could be the version effect is mechanical—something about the separate versions itself causes the effect. To test this we looked at whether the version effect occurs among artificial respondents with additive, main-effects utilities: if it does, then the effect is mechanical and not psychological. Rather than start from scratch and create artificial respondents who might have patterns of preference heterogeneity unlike those of humans, we started with utility data from human respondents. For a worst-case analysis, we used the study with the highest contribution from the version effect, study 1 above. We massaged these utilities gently, standardizing each respondent’s utilities so that all respondents had the same magnitude of utilities (the same standard deviation across utilities) as the average of human respondents in study 1. In doing this we retain a realistic human-generated pattern of heterogeneity, at the same time removing utility magnitude as an explanation for any heterogeneity. Then we added logit choice rule-consistent independently, identically distributed (i.i.d.) errors from a Gumbel distribution to the total utility of each alternative in each choice set for each respondent. We had our artificial respondents choose the highest utility alternative in each choice set and choice sets constituted the same versions the human respondents received in Study 1. Finally, we ran HB-MNL to generate utilities. When we ran the decomposition described above on the utilities the version effects virtually disappeared, with the variance explained by version effects falling from 4.9% of observed heterogeneity for human respondents to 0.04% among the artificial respondents. Thus a mechanical source does not explain the version effect. DO CONTEXT EFFECTS EXPLAIN THE VERSION EFFECT? Perhaps context effects could explain the version effect. Perhaps some respondents see some particular levels or level combinations early in their survey and they answer the remainder of the survey differently than do respondents who saw different initial questions. At the conference the suggestion came up that we could investigate this by checking whether the version effect differs in studies wherein choice tasks appear in random orders versus those in which choice tasks occur in a fixed order. We went back to our nine studies and found that three of them showed tasks in a random order while six had their tasks asked in a fixed order within each version. It turned out that studies with randomized task orders had slightly smaller version effects (explaining 0.7% of observed heterogeneity) than those with tasks asked in a constant order in each version (explaining 1.9% of observed heterogeneity). So the order in which respondents see choice sets may explain part of the small version effect we observe. Knowing this we can randomize question order within versions to obscure, but not remove, the version effect. CONCLUSIONS/DISCUSSION The effect of version on utilities is often significant but invariably small. It accounts for an average under 2% of total observed heterogeneity in our nine empirical studies and in no case as high as 5%. A much larger source of variance in utilities owes to magnitude differences, an average of almost 11% in our studies and nearly 28% in one of them. The two effects together 132 account for about 12% of total observed heterogeneity in the nine data sets we investigated and thus do not by themselves explain more than a fraction of the total heterogeneity we observe. We would like to say that this means that the unexplained 88% of variance in utilities constitutes true preference heterogeneity among respondents but we conclude more cautiously that true preference heterogeneity or as yet unidentified sources of variance explain the remaining 88%. Some of our simulation work points to another possible culprit: differences in respondent consistency (reflected in the logit scale factor) may have both a fixed and a random component to their effect on respondent-level utilities. Our analysis, using utility magnitude as a proxy, attempts to pull out the fixed component of the effect respondent consistency has on utilities but it does not capture the random effect. This turns out to be a large enough topic in its own right to take us beyond the scope of this paper. We think it best left as an investigation for another day. Keith Chrzan Aaron Hill REFERENCES Ben-Akiva, M. and S.R. Lerman (1985) Discrete Choice Analysis: Theory and Application to Travel Demand. Cambridge: MIT. 133 FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO CREATE VALUE KARLAN WITT DEB PLOSKONKA CAMBIA INFORMATION GROUP OVERVIEW Many organizations collect data across multiple stakeholders (e.g., customers, shareholders, investors) and sources (e.g., media, research, investors), intending to use this information to become more nimble and efficient, able to keep pace with rapidly changing markets while proactively managing business risks. In reality, companies often find themselves drowning in data, struggling to uncover relevant insights or common themes that can inform their business and media strategy. The proliferation of social media, and the accompanying volume of news and information, has added a layer of complexity to the already overwhelming repository of data that organizations are mining. So the data is “big” long before the promise of “Big Data” leverage and insights are seen. Attempts to analyze data streams individually often fail to uncover the most relevant insights, as the resulting information typically remains in its original silo rather than informing the broader organization. When consuming social media, companies are best served by fusing social media data with data from other sources to uncover cohesive themes and outcomes that can be used to drive value across the organization. USING DATA TO UNLOCK VALUE IN AN ORGANIZATION Companies typically move through three distinct stages of data integration as they try to use data to unlock value in their organization: Stages of Data Integration Stage 1: Instill Appetite for Data This early stage often begins with executives saying, “We need to track everything important to our business.” In Stage 1, companies develop an appetite for data, but face a number of challenges putting the data to work. The richness and immediacy of social media feedback provides opportunity for organizations to quickly identify risks to brand health, opportunities for customer engagement, and other sources of value creation, making the incorporation of social 135 media data an important component of business intelligence. However, these explosively rich digital channels often leave companies drowning in data. With over 300 million Facebook users, 200 million bloggers, and 25 million Twitter users, companies can no longer control the flood of information circulating about their firms. Translating data into metrics can provide some insights, but organizations often struggle with: Interacting with the tools set up for accessing these new data streams; Gaining a clear understanding of what the metrics mean, what a strong or weak performance on each metric would look like, and what impact they might represent for the business; and Managing the volume of incoming information to identify what should be escalated for potential action. To address these challenges, data streams are often driven into silos of expertise within an organization. In Stage 1 firms, the silos, such as Social Media, Traditional Media, Paid Media Data, and Research, seldom work closely together to bring all the organization’s data to full Big Data potential. Instead, the periodic metrics from a silo such as social media (number of blogs, tweets, news mentions, etc.) published by silos lack richness and recommended action, leading organizations to realize they don’t need data, they need information. Stage 2: Translate Data into Information In Stage 2, companies often engage external expertise, particularly in social media, in an effort to understand how best to digest and prioritize information. In outsourcing the data consumption and analysis, organizations may see reports with norms and additional contextual data, but still no “answers.” In addition, though an organization may have a more experienced person monitoring and condensing information, the data has effectively become even more siloed, with research and media expertise typically having limited interaction. Most organizations remain in this stage of data use, utilizing dashboards and publishing reports across the organization, but never truly understanding how organizations faced with an influx of data can intelligently consume information to drive value creation. Stage 3: Create Value Organizations that move to Stage 3 fuse social media metrics with other research data streams to fully inform their media strategy and other actions they might take. The following approach outlines four steps a company can take to help turn their data into actionable information. 1. Identify Key Drivers of Desired Business Outcomes Using Market Research: In today’s globally competitive environment, company perception is impacted by a broad variety of stakeholders, each with a different focus on what matters. In this first step, organizations can identify the Key Performance Indicators (attributes, value propositions, core brand values, etc.) that are most important to each stakeholder audience. From this, derived importance scores are calculated to quantify the importance of each KPI. As seen below, research indicates different topics are more or less important to different stakeholders. This becomes important as we merge this data into the social media 136 monitoring effort. As spikes occur in social media for various topics, a differential response by stakeholder can occur, informed by the derived importance from research. Example: Topic Importance by Stakeholder 2. Identify Thresholds: Once the importance scores for KPI’s are established, organizations must identify thresholds for each audience segment. Alternative Approaches for Setting “Alert Thresholds” 1. Set a predefined threshold, e.g., 10,000 mentions warrants attention. 2. Compare to past data. Set a multiplier of the absolute number of mentions over a given time period, e.g., 2x, 3x; alternatively, use the Poisson test to identify a statistical departure from the expected. 3. Compare to past data. If the distribution is normal, set alerts for a certain number of standard deviations from the mean, if not use the 90th or 95th percentile of historical data for similar time periods. 4. Model historical data with time series analyses to account for trends and/or seasonality in the data. From this step, organizations should also determine the sensitivity of alerts based on individual client preferences. 137 “High” Sensitivity Alerts are for any time there is a spike in volume that may or may not have significant business impacts. “Low” Sensitivity Alerts are only for extreme situations which will likely have business implications. 138 3. Media Monitoring: Spikes in media coverage are often event-driven, and by coding incoming media by topic or theme, clients can be cued to which topics may be spiking at any given moment. In the example below, digital media streams from Forums, Facebook, YouTube, News, Twitter, and Blogs are monitored, with a specific focus on two key topics of interest. Example: Media Monitoring by Topic Similarly, information can be monitored by topic on a dashboard, with topic volume spikes triggering alerts delivered according to their importance, as in the example below: 139 Although media monitoring in this instance is set up by topic, fusing research data with social media data allows the importance to each stakeholder group to be identified, as in the example below: 4. Model Impact on KPI’s: Following an event, organizations can model the impact on the Key Performance Indicators by audience. Analyzing pre- and post-event measures across KPIs will help to determine the magnitude of any impact, and will also help uncover if a specific sub-group within an audience was impacted more severely than others. Identifying the key attributes most impacted within the most affected sub-group would suggest a course of action that enables an organization to prioritize its resources. CASE STUDY A 2012 study conducted by Cambia Information Group investigated the potential impact of social media across an array of KPIs among several stakeholders of interest for a major retailer. Cambia had been conducting primary research for this client for a number of years among these audiences and was able to supplement this research with media performance data across the social media spectrum. Step 1: Identify Key Drivers of Desired Business Outcomes Using Market Research This first step looks only at the research data. The research data used is from an on-going tracking study that incorporates both perceptual and experiential-type attributes about this company and top competitors. From the research data, we find that different topics are more or less important to different stakeholders. The data shown in this case study is real, although the company and attribute labels have been abstracted to ensure the learnings the company gained through this analysis remain a competitive advantage. The chart below shows the beta values across the top KPIs across all stakeholder groups. If you are designing the research as a benchmark specifically to inform this type of intelligence system, the attributes should be those things that can impact a firm’s success. For example, if you are designing attributes for a conjoint study and it is about cars, and all competitors in the study have equivalent safety records, safety may not make the list of attributes to be included. Choice of color or 2- vs. 4-doors might be included. 140 However, when you examine the automotive industry over time with various recalls and elements that have caused significant brand damage among car manufacturers, safety would be a key element. We recommend that safety would be a variable that is included in a study informing a company about the media and what topics have the ability to impact the firm (especially negatively). Car color and 2- vs. 4-doors would not be included in this type of study. Just looking at the red-to-green conditional formatting of the betas on the table below, it is immediately clear that the importance values vary within and between key stakeholder groups. Step 2: Identify Thresholds for Topics The second step moves from working with the research data to working with the social media data sources. The goal of this step is to develop a quantitative measure of social media activity and set a threshold that, once reached, can trigger a notification to the client. This is a multi-step process. For this data set, we used the following methodology: 1. Relevant topics are identified to parallel those chosen for the research. Social media monitoring tools have the ability to listen to the various channels and tag the pieces that fall under each topic. 2. Distributions of volume by day by topic were studied for their characteristics. While some topics maintained a normal distribution, others did not. Given the lack of normality, an approach involving mean and standard deviation was discarded. 3. Setting a pre-defined threshold was discarded as too difficult to support in a client context. Additionally, the threshold would need to take into account the increasing volume of social media over time. 4. Time series analyses would have been intriguing and are an extension to be considered down the road, although it requires specialized software and a client who is comfortable with advanced modeling. 5. Distributions by day evidence a “heartbeat” pattern—low on the weekends, higher on the weekdays. Thresholds need to account for this differential. Individuals clearly engage in 141 social media behavior while at work—or more generously, perhaps as part of their role at work. 6. For an approach that a client could readily explain to others, it was settled on referencing the median of the non-normal distribution, and from this point forward taking the 90th or 95th percentile and flagging it for alerts. Given that some clients may wish to be notified more often, an 85th percentile is also offered. Cuts were identified for both weekday and weekend, and to account for the rise in social media volume, reference no more than the past 6 months of data. So the thresholds were set up with the high (85th percentile) and low (95th percentile) sensitivity levels. For our client, a manager-level team member received all high-sensitivity notifications (high sensitivity as described earlier means it detects every small movement and sends a notice). Senior staff received only low sensitivity notices. Since these were only the top 5% of all events, these were hypothesized to carry potential business implications. Step 3: Media Monitoring This step is where the research analytics and the social media analytics come together. The attributes measured in the research and for which we have importance values by attribute by stakeholder group are aligned with the topics that have been set up in the social media monitoring tools. Because we were dealing with multiple topics across multiple stakeholder groups, we chose to extend the online reporting for the survey data to provide a way to view the status of all topics, in addition to the email alerts. 142 Looking at this by topic (topics are equivalent to attributes, like “safety”), shows the current status of each, such as the view below: The specific way these are shown can be incorporated in many different ways depending on the systems available to you. Step 4: Model Impact on KPI’s Although media monitoring is set up by topic, the application of the research data allows the importance to each stakeholder group to be identified. As an example, a spike in social media about low employee wages or other fair labor practice violations might have a negative impact on the employee audience, a moderate impact on customers, and no impact on the other stakeholder groups. 143 Using this data, our client was very rapidly able to respond to an event that popped in the media that was unexpected. The company was immediately able to identify which audiences would potentially be impacted by the media coverage, and focus available resources in messaging to the right audiences. CONCLUSION: This engagement enabled the firm to take three key steps: 1. Identify problem areas or issues, and directly engage key stakeholder groups, in this case the Voting Public and their Investors; 2. Understand the window of opportunity (time lag) between negative coverage and its impact on the organization’s brand health; 3. Predict the brand health impact from social media channels, and affect that impact through messaging of their own. Potential Extensions for inclusion of alert or risk ratings include: 1. Provide “share of conversation” alerts, 2. Develop alert ratings within segments, 3. Incorporate potential media exposure to calculate risk ratio for each stakeholder group to any particular published item, 4. Expand the model which includes the impact of each source, 5. A firm’s overall strategy. 144 Karlan Witt 145 BRAND IMAGERY MEASUREMENT: ASSESSMENT OF CURRENT PRACTICE AND A NEW APPROACH1 PAUL RICHARD MCCULLOUGH MACRO CONSULTING, INC. EXECUTIVE SUMMARY Brand imagery research is an important and common component of market research programs. Traditional approaches, e.g., ratings scales, have serious limitations and may even sometimes be misleading. MaxDiff scaling adequately addresses the major problems associated with traditional scaling methods, but historically has had, within the context of brand imagery measurement, at least two serious limitations of its own. Until recently, MaxDiff scores were comparable only to items within the MaxDiff exercise. Traditional MaxDiff scores are relative, not absolute. Dual Response (anchored)MaxDiff has substantially reduced this first problem but may have done so at the price of reintroducing scale usage bias. The second problem remains: MaxDiff exercises that span a reasonable number of brands and brand imagery statements often take too long to complete. The purpose of this paper is to review the practice and limitations of traditional brand measurement techniques and to suggest a novel application of Dual Response MaxDiff that provides a superior brand imagery measurement methodology that increases inter-item discrimination and predictive validity and eliminates both brand halo and scale usage bias. INTRODUCTION Brand imagery research is an important and common component of most market research programs. Understanding the strengths and weaknesses of a brand, as well as its competitors, is fundamental to any marketing strategy. Ideally, any brand imagery analysis would not only include a brand profile, providing an accurate comparison across brands, attributes and respondents, but also an understanding of brand drivers or hot buttons. Any brand imagery measurement methodology should, at a minimum, provide the following: Discrimination between attributes, for a given brand (inter-attribute comparisons) Discrimination between respondents or segments, for a given brand and attribute (inter-respondent comparisons) Good fitting choice or purchase interest model to identify brand drivers (predictive validity) With traditional approaches to brand imagery measurement, there are typically three interdependent issues to address: 1 Minimal variance across items, i.e., flat responses Brand halo The author wishes to thank Survey Sampling International for generously donating a portion of the sample used in this paper. 147 Scale usage bias Resulting data are typically non-discriminating, highly correlated and potentially misleading. With high collinearity, regression coefficients may actually have reversed signs, leading to absurd conclusions, e.g., lower quality increases purchase interest. While scale usage bias may theoretically be removed via modeling, there is reason to suspect any analytic attempt to remove brand halo since brand halo and real brand perceptions are typically confounded. That is, it is difficult to know whether a respondent’s high rating of Brand A on perceived quality, for example, is due to brand halo, scale usage bias or actual perception. Thus, the ideal brand imagery measurement technique will exclude brand halo at the data collection stage rather than attempt to correct for it at the analytic stage. Similarly, the ideal brand imagery measurement technique will eliminate scale usage bias at the data collection stage as well. While the problems with traditional measurement techniques are well known, they continue to be widely used in practice. Familiarity and simplicity are, no doubt, appealing benefits of these techniques. Among the various methods used historically, the literature suggests that comparative scales may be slightly superior. An example of a comparative scale is below: Some alternative techniques have also garnered attention: MaxDiff scaling, method of paired comparisons (MPC) and Q-sort. With the exception of Dual Response MaxDiff (DR MD), these techniques all involve relative measures rather than absolute. MaxDiff scaling, MPC and Q-sort all are scale-free (no scale usage bias), potentially have no brand halo2 and demonstrate more discriminating power than more traditional measuring techniques. MPC is a special case of MaxDiff; as it has been shown to be slightly less effective it will not be further discussed separately. With MaxDiff scaling, the respondent is shown a random subset of items and asked to pick which he/she most agrees with and which he/she least agrees with. The respondent is then shown several more subsets of items. A typical MaxDiff question is shown below: 2 These techniques do not contain brand halo effects if and only if the brand imagery measures are collected for each brand separately rather than pooled. 148 Traditional MaxDiff3 With Q-sorting, the respondent is asked to place into a series of “buckets” a set of items, or brand image attributes, from best describes the brand to least describes the brand. The number of items in each bucket roughly approximates a normal distribution. Thus, for 25 items, the number of items per bucket might be: First bucket Second bucket Third bucket Fourth bucket Fifth bucket Sixth bucket Seventh bucket 1 item 2 items 5 items 9 items 5 items 2 items 1 item MaxDiff and Q-sorting adequately address two of the major issues surrounding monadic scales, inter-attribute comparisons and predictive validity, but due to their relative structure do not allow inter-brand comparisons. That is, MaxDiff and Q-sorting will determine which brand imagery statements have higher or lower scores than other brand imagery statements for a given brand but can’t determine which brand has a higher score than any other brand on any given statement. Some would argue that MaxDiff scaling also does not allow inter-respondent comparisons due to the scale factor. Additionally, as a practical matter, both techniques currently accommodate fewer brands and/or attributes than traditional techniques. Both MaxDiff scaling and Q-sorting take much longer to field than other data collection techniques and are not comparable across studies with different brand and/or attribute sets. Qsorting takes less time to complete than MaxDiff and is somewhat less discriminating. As mentioned earlier, MaxDiff can be made comparable across studies by incorporating the Dual Response version of MaxDiff, which allows the estimation of an absolute reference point. This reference point may come at a price. The inclusion of an anchor point in MaxDiff exercises may reintroduce scale usage bias into the data set. However, for Q-sorting, there is currently no known approach to establish an absolute reference point. For that reason, Q-sorting, for the purposes of this paper, is eliminated as a potential solution to the brand measurement problem. Also, for both MaxDiff and Q-sorting the issue of data collection would need to be addressed. As noted earlier, to remove brand halo from either a MaxDiff-based or Q-sort-based 3 The form of MaxDiff scaling used in brand imagery measurement is referred to as Brand-Anchored MaxDiff (BA MD) 149 brand measurement exercise, it will be necessary to collect brand imagery data on each brand separately, referred to here as Brand-Anchored MaxDiff. If the brands are pooled in the exercise, brand halo would remain. Thus, there is the very real challenge of designing the survey in such a way as to collect an adequate amount of information to accurately assess brand imagery at the disaggregate level without overburdening the respondent. Although one could estimate an aggregate level choice model to estimate brand ratings, that approach is not considered viable here because disaggregate brand ratings data are the current standard. Aggregate estimates would yield neither familiar nor practical data. Specifically, without disaggregate data, common cross tabs of brand ratings would be impossible as would the more advanced predictive model-based analyses. A NEW APPROACH Brand-Anchored MaxDiff, with the exception of being too lengthy to be practical, appears to solve, or at least substantially mitigate, most of the major issues with traditional methods of brand imagery measurement. The approach outlined below attempts to minimize the survey length of Brand-Anchored MaxDiff by increasing the efficiency of two separate components of the research process: Survey instrument design Utility estimation Survey Instrument A new MaxDiff question format, referred to here as Modified Brand-Anchored MaxDiff, accommodates more brands and attributes than the standard design. The format of the Modified Brand-Anchored MaxDiff used in Image MD is illustrated below: 150 To accommodate the Dual Response form of MaxDiff, a Direct Binary Response question is asked prior to the MBA MD task set4: To address the potential scale usage bias of MaxDiff exercises with Direct Binary Response, a negative Direct Binary Response question, e.g., For each brand listed below, please check all the attributes that you feel strongly do not describe the brand, is also included.5 As an additional attempt to mitigate scale usage bias, the negative Direct Binary Response was asked in a slightly different way for half the sample. Half the sample were asked the negative Direct Binary Response question as above. The other half were asked a similar question except that respondents were required to check as many negative items as they had checked positive. The first approach is referred to here as unconstrained negative Direct Binary Response and the second is referred to as constrained negative Direct Binary Response. In summary, Image MD consists of an innovative MaxDiff exercise and two direct binary response questions, as shown below: 4 5 This approach to Anchored MaxDiff was demonstrated to be faster to execute than the traditional Dual Response format (Lattery 2010). Johnson and Fuller (2012) note that Direct Binary Response yields a different threshold than traditional Dual Response. By collecting both positive and negative Direct Binary Response data, we will explore ways to mitigate this effect. 151 It is possible, in an online survey, to further increase data collection efficiency with the use of some imaginative programming. We have developed an animated way to display Image MD tasks which can be viewed at www.macroinc.com (Research Techniques tab, MaxDiff Item Scaling). Thus, the final form of the Image MD brand measurement technique can be described as Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response. Utility Estimation Further, an exploration was conducted to reduce the number of tasks seen by any one respondent and still retain sufficiently accurate disaggregate brand measurement data. MaxDiff utilities were estimated using a Latent Class Choice Model (LCCM) and using a Hierarchical Bayes model (HB). By pooling data across similarly behaving respondents (in the LCCM), we hoped to substantially reduce the number of MaxDiff tasks per respondent. This approach may be further enhanced by the careful use of covariates. Another approach that may require fewer MaxDiff tasks per person is to incorporate covariates in the upper model of an HB model or running separate HB models for segments defined by some covariate. To summarize, the proposed approach consists of: 152 Animated Modified Brand-Anchored MaxDiff Exercise With Direct Binary Responses (both positive and negative) Analytic-derived parsimony: o Latent Class Choice Model: Estimate disaggregate MaxDiff utilities Use of covariates to enhance LCCM accuracy o Hierarchical Bayes: HB with covariates in upper model Separate HB runs for covariate-defined segments Adjusted Priors6 RESEARCH OBJECTIVE The objectives, then, of this paper are: To compare this new data collection approach, Animated Modified Brand-Anchored MaxDiff with Direct Binary Response, to a traditional approach using monadic rating scales To compare the positive Direct Binary Response and the combined positive and negative Direct Binary Response To confirm that Animated Modified Brand-Anchored MaxDiff with Direct Binary Response eliminates brand halo To explore ways to include an anchor point without reintroducing scale usage bias To explore utility estimation accuracy of LCCM and HB using a reduced set of MaxDiff tasks To explore the efficacy of various potential covariates in LCCM and HB STUDY DESIGN A two cell design was employed: Traditional brand ratings scales in one cell and the new MaxDiff approach in the other. Both cells were identical except in the method that brand imagery data were collected: Traditional brand ratings scales o Three brands, each respondent seeing all three brands o 12 brand imagery statements Animated Modified Brand-Anchored MaxDiff with Direct Binary Response o Three brands, each respondent seeing all three brands o 12 brand imagery statements o Positive and negative Direct Binary Response questions Cells sizes were: Monadic ratings cell - n = 436 Modified MaxDiff - n = 2,605 o Unconstrained negative DBR - n = 1,324 o Constrained negative DBR - n = 1,281 The larger sample size for the second cell was intended so that attempts to reduce the minimum number of choice tasks via LCCM and/or HB could be fully explored. 6 McCullough (2009) demonstrates that tuning HB model priors can improve hit rates in sparse data sets. 153 Both cells contained: Brand imagery measurement (ratings or MaxDiff) Brand affinity measures Demographics Holdout attribute rankings data RESULTS Brand Halo We check for brand halo using confirmatory factor analysis, building a latent factor to capture any brand halo effect. If the brand halo exists, the brand halo latent factor will positively influence scores on all items. We observed a clear brand halo effect among the ratings scale data, as expected. The unanchored MaxDiff data showed no evidence of the effect, also as expected. The positive direct binary response reintroduced the brand halo effect to the MaxDiff ratings at least as strong as the ratings scale data. This was not expected. However, the effect seems to be totally eliminated with the inclusion of either the constrained or unconstrained negative direct binary question. Brand Halo Confirmatory Factor Analytic Structure 154 Brand Halo Latent Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10 Item 11 Item 12 Ratings Std Beta 0.85 0.84 0.90 0.86 0.77 0.85 0.83 0.82 0.88 0.87 0.77 0.88 Prob *** *** *** *** *** *** *** *** *** *** *** na No DBR Std Beta -0.14 -0.38 -0.20 0.10 -0.68 -0.82 0.69 0.24 0.58 0.42 -0.05 0.26 Prob *** *** *** *** *** *** *** *** *** *** 0.02 na Positive DBR Std Beta 0.90 0.78 0.95 0.90 0.88 0.87 0.83 0.75 0.90 0.94 0.85 0.91 Prob *** *** *** *** *** *** *** *** *** *** *** na Unconstrained Negative DBR Std Beta Prob 0.44 *** -0.56 *** 0.42 *** 0.30 *** 0.03 0.25 -0.21 *** 0.42 *** 0.01 0.87 0.77 *** 0.86 *** 0.07 0.02 0.69 na Constrained Negative DBR Std Beta Prob 0.27 *** -0.72 *** 0.32 *** 0.16 *** 0.01 0.78 -0.24 *** 0.20 *** -0.23 *** 0.62 *** 0.90 *** -0.12 *** 0.53 na Scale Usage As with our examination of brand halo, we use confirmatory factor analysis to check for the presence of a scale usage factor. We build in latent factors to capture brand halo per brand, and build another latent factor to capture a scale usage bias independent of brand. If a scale usage bias exists, the scale latent factor should load positively on all items for all brands. 155 Scale Usage Bias and Brand Halo Confirmatory Factor Analytic Structure We observe an obvious scale usage effect with the ratings data, where the scale usage latent loads positively on all 36 items. Again, the MaxDiff with only positive direct binary response shows some indication of scale usage bias, even with all three brand halo latents simultaneously accounting for a great deal of collinearity. Traditional MaxDiff, and the two versions including positive and negative direct binary responses all show no evidence of a scale usage effect. Scale Usage Latent Number of Negative Loadings Number of Statistically Significant loadings 156 Ratings No DBR Positive DBR Unconstrained Negative DBR Constrained Negative DBR 0 14 5 10 15 36 31 29 33 30 Predictive Validity In the study design we included a holdout task which asked respondents to rank their top three item choices per brand, giving us a way to test the accuracy of the various ratings/utilities we collected. In the case of all MaxDiff data we compared the top three scoring items to the top three ranked holdout items per person, and computed the hit rate. This approach could not be directly applied to scale ratings data due to the frequency of flat responses (e.g., it is impossible to identify top three if all items were rated the same). For the ratings data we estimated hit rate using this approach: if the highest rated item from the holdout received the highest ratings score which was shared by n items, we added 1/n to the hit rate. Similarly, the second and third highest ranked holdout items received an adjusted hit point if those items were among the top 3 rated items. We observe that each of the MaxDiff data sets vastly outperformed ratings scale data, which performed roughly the same as randomly guessing the top three ranked items. Hit Rates Random Numbers Ratings No DBR Positive DBR Unconstrained Negative DBR Constrained Negative DBR 1 of 1 8% 14% 27% 28% 27% 26% 32% 30% 62% 64% 62% 65% 61% 51% 86% 87% 86% 88% (1 or 2) of 2 (1, 2 or 3) of 3 Inter-item discrimination Glancing visually at the resulting item scores, we can see that each of the MaxDiff versions show greater inter-item discrimination, and among those, both negative direct binary versions bring the lower performing brand closer to the other two brands. 157 Ratings Scales MaxDiff with Positive DBR MaxDiff with Positive DBR & Constrained Negative DBR 158 MaxDiff with Positive DBR & Unconstrained Negative DBR To confirm, we considered how many statistically significant differences between statements could be observed within each brand per data collection method. The ratings scale data yielded fewest statistically significant differences across items, while the MaxDiff with positive and unconstrained negative direct binary responses yielded the most. Traditional MaxDiff and MaxDiff with positive and constrained negative direct binary responses also performed very well, while the MaxDiff with only positive direct binary performed much better than ratings scale data, but clearly not as well as the remaining three MaxDiff methods. Average number of statistically significant differences across 12 items Ratings Brand#1 No DBR Positive DBR Unconstrained Negative DBR Constrained Negative DBR 1.75 4.46 3.9 4.3 4.68 New Brand 0 4.28 3.16 4.25 4.5 Brand#2 1 4.69 3.78 4.48 4.7 Completion Metrics With using a more sophisticated data collection method come a few costs in respondent burden. It took respondents much longer to complete any of the MaxDiff exercises than it took them to complete the simple ratings scales. The dropout rate during the brand imagery section of the survey (measured as percentage of respondents who began that section but failed to finish it) was also much higher among the MaxDiff versions. Though on the plus side for the MaxDiff versions, when preparing the data for analysis we were forced to drop far fewer respondents due to flat-lining. 159 Ratings All MaxDiff Versions Brand Image Measurement Time (Minutes) 1.7 6 Incompletion Rate 9% 31% Post-field drop rate 32% 4% Exploration to Reduce Number of Tasks Necessary We find these results to be generally encouraging, but would like to explore if anything can be done to reduce the increased respondent burden and dropout rates. Can we reduce the number of tasks each respondent is shown, without compromising the predictive validity of the estimated utilities? To find out, we estimated disaggregate utilities using two different estimation methods (Latent Class and Hierarchical Bayes), varying the numbers of tasks, and using certain additional tools to bolster the quality of the data (using covariates, or adjusting priors, etc.). We continued only with the two MaxDiff methods with both positive and negative direct binary responses, as those two methods proved best in our analysis. All estimation routines were run for both the unconstrained and constrained versions, allowing us to further compare these two methods. Our chosen covariates included home ownership (rent vs. own), gender, purchase likelihood for the brand we were researching, and a few others. Including these covariates when estimating utilities in HB should yield better individual results by allowing the software to make more educated estimates based on respondents’ like peers. With covariates in place, utilities were estimated using data from 8 (full sample), 4, and 2 MaxDiff tasks, and hit rates were computed for each run. We were surprised to discover that using only 2 tasks yielded only slightly less accuracy than using all 8 tasks. And in all cases, hit rates seem to be mostly maintained despite decreased data. Using Latent Class the utilities were estimated again using these same 6 data sub-sets. As with HB, reducing the number of tasks used to estimate the utilities had minimal effect on the hit rates. It is worth noting here, that when using all 8 MaxDiff tasks latent class noticeably underperforms Hierarchical Bayes, but this disparity decreases as tasks are dropped. 160 Various Task Hit Rates HB LC Unconstrained Negative DBR 8 Tasks 4 Tasks 2 Tasks Constrained Negative DBR 8 Tasks 4 Tasks 2 Tasks 1 of 1 27% 21% 20% 26% 24% 22% (1 or 2) of 2 62% 59% 58% 65% 61% 59% (1, 2 or 3) of 3 86% 85% 82% 88% 86% 85% 1 of 1 19% 20% 19% 21% 21% 22% (1 or 2) of 2 54% 57% 56% 61% 59% 56% (1, 2 or 3) of 3 81% 82% 83% 84% 84% 82% In estimating utilities in Hierarchical Bayes, it is possible to adjust the Prior degrees of freedom and the Prior variance. Generally speaking, adjusting these values allows the researcher to change the emphasis placed on the upper level model. In dealing with sparse data sets, adjusting these values may lead to more robust individual utility estimates. Utilities were estimated with data from 4 tasks, and with Prior degrees of freedom from 2 to 1000 (default is 5), and Prior variance from 0.5 to 10 (default is 2). Hit rates were examined at various points on these ranges, and compared to the default settings. After considering dozens of non-default configurations we observed essentially zero change in hit rates. At this point it seemed that there was nothing that could diminish the quality of these utilities, which was a suspicious finding. In searching for a possible explanation, we hypothesized that these data simply have very little heterogeneity. The category of product being researched is not emotionally engaging (light bulbs), and the brands being studied are not very differentiated. To test this hypothesis, an additional utility estimation was performed, using only data from 2 tasks, and with a drastically reduced sample size of 105. Hit rates were computed for the low sample run both at the disaggregate level, that is using unique individual utilities, and then again with each respondents utilities equal to the average of the sample (constant utilities). 161 Random Choices 1 of 1 (1 or 2) of 2 (1, 2 or 3) of 3 Unconstrained Negative DBR HB HB HB 2 Tasks 8 Tasks 2 Tasks N=105 N=1,324 N=105 Constant Utils 8% 27% 22% 25% 32% 62% 59% 61% 61% 86% 82% 82% These results seem to suggest that there is very little heterogeneity for our models to capture in this particular data; explaining why even low task utility estimates yield fairly high hit rates. Unfortunately, this means what we cannot say whether we can reduce survey length of this new approach by reducing the number of tasks needed for estimation. Summary of Results Provides Absolute Reference Point Brand Halo Scale Usage Bias Inter-Item Discrimination Predictive Validity Complete Time Dropout Rate Post-Field Drop Rate Constrained Unconstrained Negative Negative DBR DBR Ratings No DBR Positive DBR No Yes Yes No No No Yes Yes Yes Yes No No Yes No No Very Low Very Low Fast Low High High High Slow High Low Fairly High High Slow High Low High High Slow High Low High High Slow High Low CONCLUSIONS The form of MaxDiff referred to here as Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response is superior to rating scales for measuring brand imagery: -Better inter-item discrimination -Better predictive validity -Elimination of brand halo -Elimination of scale usage bias -Fewer invalid completes 162 Using positive DBR alone to estimate MaxDiff utilities reintroduces brand halo and possibly scale usage bias. Positive DBR combined with some form of negative DBR to estimate MaxDiff utilities eliminates both brand halo and scale usage bias. Utilities estimated with Positive DBR have slightly weaker inter-item discrimination than utilities estimated with Negative DBR. The implication to these findings regarding DBR is that perhaps MaxDiff, if anchored, should always incorporate both positive and negative DBR since positive DBR alone produces highly correlated MaxDiff utilities with less inter-item discrimination. Another, more direct implication, is that Brand-Anchored MaxDiff with both positive and negative DBR is superior to Brand-Anchored MaxDiff with only positive DBR for measuring brand imagery. Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct Binary Response takes longer to administer and has higher incompletion rates, however, and further work needs to be done to make the data collection and utility estimation procedures more efficient. Paul Richard McCullough REFERENCES Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2007), “Making MaxDiff more informative: statistical data fusion by way of latent variable modeling,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2008), “Comparing Apples to Oranges,” Marketing Research Magazine, Spring 2008 Bochenholt, U. (2004), “Comparative judgements as an alternative to ratings: Identifying the scale origin,” Psychological Methods Chrzan, Keith and Natalia Golovashkina (2006), “An Empirical Test of Six Stated Importance Measures,” International Journal of Marketing Research Chrzan, Keith and Doug Malcom (2007), “An Empirical Test of Alternative Brand Systems,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA 163 Chrzan, Keith and Jeremy Griffiths (2005), “An Empirical Test of Brand-Anchored Maximum Difference Scaling,” 2005 Design and Innovations Conference, Berlin Cohen, Steven H. (2003), “Maximum Difference Scaling: Improved Measures of Importance and Preference for Segmentation,” Sawtooth Software Research Paper Series Dillon, William R., Thomas J. Madden, Amna Kirmani and Soumen Mukherjee (2001), "Understanding What’s in a Brand Rating: A Model for Assessing Brand and Attribute Effects and Their Relationship to Brand Equity," JMR Hendrix, Phil, and Drucker, Stuart (2007), “Alternative Approaches to Maxdiff With Large Sets of Disparate Items-Augmented and Tailored Maxdiff, 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA Horne, Jack, Bob Rayner, Reg Baker and Silvo Lenart (2012), “Continued Investigation Into the Role of the ‘Anchor’ in MaxDiff and Related Tradeoff Exercises,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL Johnson, Paul, and Brent Fuller (2012), “Optimizing Pricing of Mobile Apps with Multiple Thresholds in Anchored MaxDiff,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL Lattery, Kevin (2010), “Anchoring Maximum Difference Scaling Against a Threshold-Dual Response and Direct Binary Responses,” 2010 Sawtooth Software Conference Proceedings, Newport Beach, CA Louviere, J.J., Marley, A.A.J., Flynn, T., Pihlens, D. (2009), “Best-Worst Scaling: Theory, Methods and Applications,” CenSoc: forthcoming. Magidson, J., and Vermunt, J.K. (2007a), “Use of a random intercept in latent class regression models to remove response level effects in ratings data. Bulletin of the International Statistical Institute, 56th Session, paper #1604, 1–4. ISI 2007: Lisboa, Portugal Magidson, J., and Vermunt, J.K. (2007b), “Removing the scale factor confound in multinomial logit choice models to obtain better estimates of preference,” 2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA McCullough, Paul Richard (2009), “Comparing Hierarchical Bayes and Latent Class Choice: Practical Issues for Sparse Data Sets,” 2009 Sawtooth Software Conference Proceedings, Delray Beach, FL Vermunt and Magidson (2008), LG Syntax User’s Guide: Manual for Latent GOLD Choice 4.5 Syntax Module Wirth, Ralph, and Wolfrath, Annette (2012), “Using MaxDiff for Evaluating Very Large Sets of Items,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL 164 ACBC REVISITED MARCO HOOGERBRUGGE JEROEN HARDON CHRISTOPHER FOTENOS SKIM GROUP ABSTRACT Adaptive Choice-Based Conjoint (ACBC) was developed by Sawtooth Software in 2009 as an alternative to their classic CBC software in order to obtain better respondent data in complex choice situations. Similar to Adaptive Conjoint Analysis (ACA) many years ago, this alternative adapts the design of the choice experiment to the specific preferences of each respondent. Despite its strengths, ACBC has not garnered the popularity that ACA did as CBC has maintained the dominant position in the discrete choice modeling market. There are several possibilities concerning the way ACBC is assessed and its various features that may explain why this has happened. In this paper, we compare ACBC to several other methodologies and variations of ACBC itself in order to assess its performance and look into potential ways of improving it. What we show is that ACBC does indeed perform very well for modeling choice behavior in complex markets and it is robust enough to allow for simplifications without harming results. We also present a new discrete choice methodology called Dynamic CBC, which combines features of ACBC and CBC to provide a strong alternative to CBC in situations where running an ACBC study may not be applicable. Though this paper will touch on some of the details of the standard ACBC procedure, for a more in-depth overview and introduction to the methodology please refer to the ACBC Technical Paper published in 2009 from the Sawtooth Software Technical Paper Series. BACKGROUND Our methodological hypotheses as to why ACBC has not yet caught up to the popularity of CBC are related to the way in which the methodology has been critiqued as well as how respondents take the exercise: 1. Our first point relates to the way in which comparison tests between ACBC and CBC have been performed in the past. We believe that ACBC will primarily be successful in markets for which the philosophy behind ACBC is truly applicable. This means that the market should consist of dozens of products so that consumers need to simplify their choices upfront by consciously or subconsciously creating an evoked set of products from which to choose. This evoked set can be different for each consumer: some consumers may restrict themselves to one or more specific brands while other consumers may only shop within specific price tiers. This aligns with the non-compensatory decision making behavior that Sawtooth Software (2009) was aiming to model when designing the ACBC platform. For example, shopping for technology products (laptops, tablets, etc.), subscriptions (mobile, insurance), or cars may be very well suited for modeling via ACBC. 165 If we are studying such a market, the holdout task(s), which are the basis for comparison between methodologies, should also reflect the complexity of the market! Simply put, the holdout tasks should be similar to the scenario that we wish to simulate. Whereas in a typical ACBC or CBC choice exercise we may limit respondents to three to five concepts to ensure that they assess all concepts, holdout tasks for complex markets can be more elaborate as they are used for model assessment rather than deriving preference behavior. 2. For many respondents in past ACBC studies we have seen that they may not have a very ‘rich’ choice tournament because they reject too many concepts in the screening section of the exercise. Some of them do not even get to the choice tournament or see just one choice tournament task. In an attempt to curb this kind of behavior we have tried to encourage respondents to allow more products through the screening section by moderately rephrasing the question text and/or the text of the two answer options (would / would not consider). We have seen that this helps a bit though not enough in expanding the choice tournament. The simulation results are mostly based on the choices in the tournament, so fewer choice tasks could potentially lead to a less accurate prediction. If ACBC may be adjusted such that the choice tournament provides ‘richer’ data, the quality of the simulations (and for the holdout predictions) may be improved. 3. Related to our first point on the realism of simulated scenarios, for many respondents we often see that the choice tournament converges to a winning concept that has a price way below what is available in the market (and below what is offered in the holdout task). With the default settings of ACBC (-30% to +30% of the summed component price), we have seen that approximately half of respondents end up with a winning concept that has a price 15– 30% below the market average. Because of this, we may not learn very much about what these respondents would choose in a market with realistic prices. A way to avoid this is to follow Sawtooth Software’s early recommendation to have an asymmetric price range (e.g., from 80% to 140% of market average) and possibly also to use a narrower price range if doing so is more in line with market reality. 4. A completely different kind of hypothesis refers to the design algorithm of ACBC. In ACBC, near-orthogonal designs are generated consisting of concepts that are “nearneighbors” to the respondent’s BYO task configuration while still including the full range of levels across all attributes. The researcher has some input into this process in that they can specify the total number of concepts to generate, the minimum and maximum number of levels to vary from the BYO concept, and the percent deviation from the total summed price as mentioned above (Sawtooth Software 2009). It may be the case that ACBC (at its default settings) is too extreme in its design of concepts at the respondent level and the Hierarchical Bayes estimation procedure may not be able to fully compensate for this in its borrowing process across respondents. For this we propose two potential solutions: a. A modest improvement for keeping the design as D-efficient as possible is to maximize the number of attributes to be varied from the BYO task. b. A solution for this problem may be found outside ACBC as well. An adaptive method that starts with a standard CBC and preserves more of CBC’s D- 166 efficiency throughout the exercise may lead to a better balance between Defficiency and interactivity than is currently available in other methodologies. THE MARKET In order to test the hypotheses mentioned above, we needed to run our experiment in a market that is suitably complex enough to evoke the non-compensatory choice behavior of respondents that is captured by the ACBC exercise. We therefore decided to use the television market in the United States. As the technology that goes into televisions has become more advanced (think smart TVs, 3D, etc.) and the number of brands available has expanded, the choice of a television has grown more elaborate as has the pricing structure of the category. There are so many features that play a role in the price of a television and the importance of these features between respondents can vary greatly. Based on research of televisions widely available to consumers in the US, we ran our study with the following attributes and levels: Attribute Brand Screen Size Screen Type Resolution Wi-Fi Capability 3D Capability Number of HDMI Inputs Total Price Levels Sony, Samsung, LG, Panasonic, Vizio 22˝, 32˝, 42˝, 52˝, 57˝, 62˝, 67˝, 72˝ LED, LCD, LED-LCD, Plasma 720p, 1080p Yes/No Yes/No 0–3 connections Summed attribute price ranged from $120 to $3,500 In order to qualify for the study, respondents had to be between the ages of 22 and 65, live independently (i.e., not supported by their parents), and be in the market for a new television sometime within the next 2 years. Respondent recruitment came through uSamp’s online panel. THE METHODS In total we looked into eight different methodologies in order to assess the performance of standard ACBC and explore ways of improving it as per our previously mentioned hypotheses. These methods included variations of ACBC and CBC as well as our own SKIM-developed alternative called Dynamic CBC, which combines key features of both methods. It is important to note that the same attributes and levels were used across all methods and that the designs contained no prohibitions within or between attributes. Additionally, all methods showed respondents a total summed component price for each television, consistent with a standard ACBC exercise, and was taken by a sample of approximately 300 respondents (2,422 in total). The methods are as follows: A. Standard ACBC—When referring to standard ACBC, we mean that respondents completed all portions of the ACBC exercise (BYO, screening section, choice tournament) and that the settings were generally in line with those recommended by Sawtooth Software including having between two and four attributes varied from the BYO exercise in the generation of concepts, six screening tasks with four concepts each, unacceptable questions after the third and fourth screening task, one must-have question 167 B. C. D. E. F. G. 168 after the fifth screening task, and a price range varying between 80–140% of the total summed component price. Additionally, for this particular leg price was excluded as an attribute in the unacceptable questions asked during the screening section. ACBC with price included in the unacceptable questions—This leg had the same settings as the standard ACBC leg however price was included as an attribute in the unacceptable questions of the screening section. The idea behind this is to increase the number of concepts that make it through the screening section in order to create a richer choice tournament for respondents, corresponding with our hypothesis that a longer, more varied choice tournament could potentially result in better data. If respondents confirm that a very high price range is unacceptable to them, the concepts they evaluate no longer contain these higher price points, so we could expect respondents to accept more concepts in the further evaluation. This of course also creates the risk that the respondent is too conservative in screening based on price and leads to a much shorter choice tournament. ACBC without a screening section—Again going back to the hypothesis of the “richness” of the choice tournament for respondents, this leg followed the same settings as our baseline ACBC exercise but skipped the entire screening section of the methodology. By skipping the screening section, this ensured that respondents would see a full set of choice tasks (in this case 12). The end result is still respondent-specific as designs are built with the “near-neighbor” concept in mind but they are just prevented from further customizing their consideration set for the choice tournament. We thereby have collected more data from the choice tournament for all respondents, providing more information for utility estimation. Additionally, skipping the screening section may lead to increased respondent engagement by way of shortening the overall length of interview. ACBC with a narrower tested price range—To test our hypothesis on the reality of the winning concept price for respondents, we included one test leg that had the equivalent settings of the standard ACBC leg with the exception of using a narrower tested price deviation from the total summed component price. For this cell a range of 90% to 120% of the total summed component price was used as opposed to 80% to 140% as used by the other legs. Although this is not fully testing our hypothesis as we are not comparing to the default 70% to 130% range, we feel that the 80% to 140% range is already an accepted better alternative to the default range and we can learn more by seeing if an even narrower range improves results. ACBC with 4 attributes varied from the BYO concept—We tested our last hypothesis concerning the design algorithm of ACBC by including a leg which forced there to be four attributes varied (out of seven possible) from the BYO concept across all concepts generated for each respondent whereas all other ACBC legs had between two and four attributes varied as previously mentioned. The logic behind this is to ensure that the nonBYO levels show up more often in the design and therefore push it closer to being more level-balanced and therefore more statistically efficient. Standard CBC—The main purpose of this leg was to serve as a comparison to standard ACBC and the other methods. Price balanced CBC—A CBC exercise that shows concepts of similar price on each screen. This has been operationalized by means of between-concept prohibitions on price. This is similar to the way ACBC concepts within a task are relatively utility balanced since they are all built off variations of the respondent’s BYO concept. Although this is not directly tied into one our hypotheses concerning ACBC, it is helpful in understanding if applying a moderate utility balance within the CBC methodology could help improve results. H. Dynamic CBC—As previously mentioned, this method is designed by SKIM and includes features of both CBC and ACBC. Just like a standard CBC exercise, Dynamic CBC starts out with an orthogonal design space for all respondents as there is no BYO or screening exercise prior to the start of the choice tasks. However like ACBC, the method is adaptive in the way in which it displays the later tasks of the exercise. At several points throughout the course of the exercise, one of the attributes in the pre-drawn orthogonal design has its levels replaced with a level from a concept previously chosen by the respondent. Between these replacements, respondents were asked which particular attribute (in this case television feature) they focused on the most when making their selections in previous tasks. The selection of the attribute to replace was done randomly though the attribute that the respondent stated they focused on the most when making their selection was given a higher probability of being drawn in this process. This idea is relatively similar to the “near-neighbors” concept of ACBC in the sense that we are fixing an attribute to a level which we are confident that the respondent prefers (like their BYO preferred level) and thus force them to make trade-offs between other attributes to gain more insights into their preferences in those areas. In this particular study, this adaptive procedure occurred at three points in the exercise. THE COMPARISON At this point we have not yet fully covered how we went about testing our first hypothesis concerning the way in which comparison tests have been performed between ACBC and other methodologies. Aside from applying all methodologies to a complex market, in this case the television market in the United States, the holdout tasks had to be representative of the typical choice that a consumer might have to make in reality. After all, a more relevant simulation from such a study is one that tries to mimic reality as best as possible. In order to better represent the market, respondents completed a set of three holdout tasks consisting of twenty concepts each (not including a none option). Within each holdout task, concepts were relatively similar to one another in terms of their total price so as not to create “easy” decisions for respondents that always went for a particular price tier or feature. We aimed to cover a different tier of the market with each task so as not to alienate any respondents because of their preferred price tier. An example holdout task can be seen below: 169 While quite often holdout tasks will be similar to the tasks of the choice exercise, in the case of complex markets it would be very difficult to have many screening or choice tournament tasks that show as many as twenty concepts on the screen and still expect reliable data from respondents. In terms of the reality of the exercise, it is important to distinguish between the reality of choice behavior and the decision in the end. Whereas the entire ACBC exercise can help us better understand the non-compensatory choice behavior of respondents while only showing three to four concepts at a time, we still must assess its predictive power in the context of an “actual” complex purchase that we aim to model in the simulator. As a result of having so many concepts available in the holdout tasks, the probability of a successful holdout prediction is much lower than what one may be used to seeing in similar research. For example, if we were to use hit rates as a means of assessing each methodology, by random chance we would be correct just 5% of the time. Holdout hit rates also provide no information about the proximity of the prediction, therefore giving no credit for when a method comes close to predicting the correct concept. Take the following respondent simulation for example: Concept Share of Preference 1 0.64% 11 1.48% 2 0.03% 12 0.88% 3 0.65% 13 29.78% 4 0.33% 14 0.95% 5 4.68% 15 27.31% 6 0.18% 16 12.99% 7 4.67% 17 4.67% 8 2.40% 18 0.17% 9 1.75% 19 4.28% 10 0.07% 20 2.11% If, for example, in this particular scenario the respondent chose concept 15, which comes in at a close second to concept 13, it would simply count the methodology as being incorrect despite how close it came to being correct. By definition, the share of preference model is also 170 telling us that there is roughly a 70.22% chance that the respondent would not choose concept 13, so it is not necessarily incorrect in the sense of telling us that the respondent would not choose concept 13 either. Because of this potential for hit rates to understate the accuracy of a method, particularly one with complex holdout tasks, we decided to use an alternative success metric: the means of the share of preference of the concepts that the respondents chose in the three holdout tasks. In the example above, if the respondent chose concept 15, their contribution to the mean share of preference metric for the methodology that they took would be a score of 27.31%. As a benchmark, this could be compared to the predicted share of preference of a random choice, which in this case would be 5%. Please note that all mean share of preferences reported for the remainder of this paper used a default exponent of one in the modeling process. While testing values of the exponent, we found that any exponent value between 0.7 and 1 yielded similar optimal results. In addition, in the tables we have also added the “traditional” hit rate, comparing first choice simulations with the actual answers to holdout tasks. A methodological note on the two measures of performance has been added at the end of this paper. RESPONDENTS ENJOYED BOTH CBC AND ACBC EXERCISES As a first result, it is important to note that there were no significant differences (using an alpha of 0.05) in respondent enjoyment or ease of understanding across the methodologies. At the end of the survey respondents were asked whether they enjoyed taking the survey and whether they found it difficult to fill in the questionnaire. Please note though that these questions were asked in reference to the entire survey, not necessarily just the discrete choice portion of the study. A summary of the results from these questions can be found below: A key takeaway from this is that there is little need to worry about whether respondents will have a difficult time taking a CBC study or enjoy it any less than an ACBC study despite being less interactive. Though this is a purely directional result, it is interesting to note that the 171 methodology that rates least difficult to fill in was ACBC without a screening section, even less so than a standard CBC exercise that does not include a BYO task. MAIN RESULTS Diving into the more important results now, what we see is that all variations of ACBC significantly outperformed all variations of CBC, though there is no significant difference (using an alpha of 0.05) between the different ACBC legs. Interestingly though, Dynamic CBC was able to outperform both standard and price balanced CBC. A summary of the results can be found in the chart below. As noticeable, all the above conclusions hold for both metrics. Some comments on specific legs: - - - 172 ACBC with a price range of 90%-120% behaved as expected in the sense that the concepts in the choice tournament had more realistic price levels. However, this did not result in more accurate predictions. ACBC with price in the unacceptable questions behaved as expected in the sense that no less than 40% of respondents rejected concepts above a certain price point. However, this did not result in a significantly richer choice tournament and consequently did not lead to better predictions. ACBC without a screening section behaved as expected in the sense that there was a much richer choice tournament (by definition, because all near-neighbor concepts were in the tournament). This did not lead to an increase in prediction performance, however. But perhaps more interesting is the fact that the prediction performance did not decline either in comparison with standard ACBC. Apparently a rich tournament compensates for dropping the screening section, so it seems like the screening section can be skipped altogether, saving a substantial amount of interview time (except for the most disengaged respondents who formerly largely skipped the tournament). INCREASING THE GRANULARITY OF THE PRICE ATTRIBUTE As we mentioned earlier, each leg showed respondents a summed total price for each concept and this price was allowed some variation according to the ranges that we specified in the setup. This is done by taking the summed component price of the concept and multiplying it by a random number in line with the specified variation range. Because of this process, price is a continuous variable rather than discrete as with the other attributes in the study. Therefore in order to estimate the price utilities, we chose to apply a piecewise utility estimation for the attribute as is generally used in ACBC studies involving price. In the current ACBC software, users are allowed to specify up to 12 cut-points in the tested price range for which price slopes will be estimated between consecutive points. Using this method, the estimated slope is consistent between each of the specified cut-points so it is important to choose these cut-points in such a way that it best reflects the varying consumer price sensitivities across the range (i.e., identifying price thresholds or periods of relative inelasticity). Given the wide range of prices tested in our study (as mentioned earlier, the summed component price ranged from $120 to $3,500) we felt that our price range could be better modeled using more cut-points than the software currently allows. Since the concepts shown to a respondent within ACBC are near-neighbors of their BYO concept, it is quite possible that through this and the screening exercise we only collected data for these respondents on a subset of the total price range. This would therefore make much of the specified cut-points irrelevant to this respondent and lead to a poorer reflection of their price sensitivity across the price range that was most relevant to them. We saw this happening especially with respondents with a BYO choice in the lowest price range. Based on this and the relative count frequencies (percent of times chosen out of times shown) in the price data, we decided to increase the number of cut-points to 28 and run the estimation using CBC HB. As you can see in the data below, ACBC benefits much more from increasing the number of cut-points than CBC (i.e., the predicted SoP in the ACBC legs increases, while the predicted SoP in the CBC legs remains stable; the predicted traditional hit rate in the ACBC legs remains nearly stable while for CBC it decreases). A reasonable explanation for this difference between the ACBC and CBC legs is the fact that in ACBC legs the variation in prices in each individual interview is generally a lot smaller than in a CBC study. So in ACBC studies it does not harm to have that many cut-points and when you would look at the mean SoP metric, it seems actually better to do so. 173 AN ALTERNATIVE MODEL COMPARISON In order to make sure that our results were not purely the result of our mean share of preference metric for comparing the methodologies, we also looked into the mean squared error of the predicted share of preferences across respondents. Using this metric, we could see how much the simulated preference shares deviated from the actual holdout task choices. This was calculated as: Where the aggregate concept share from the holdout task is the percent of respondents that selected the concept in the holdout task and the aggregate mean predicted concept SoP is the mean share of preference for the concept across all respondents. As displayed in the table below, using mean squared error as a success metric also confirms our previous results based on the mean share of preference metric: 174 When looking into these results, it was a bit concerning that the square root of the mean squared error was close to the average aggregate share of preference for a concept (5% = 1/20 concepts), particularly for the CBC exercises. Upon further investigation, we noticed that there was one particular concept in one of the holdout tasks that generates a large amount of the squared error in the table above. This concept was the cheapest priced concept in the holdout task that contained products of a higher price tier than the other two holdout tasks. At an aggregate level, this particular concept had an actual holdout share of 43% but the predicted share of preference was significantly lower: the ACBC modules had an average share of preference of 27% for this concept whereas the standard CBC leg (8%) and price-balanced CBC leg (17%) performed much worse. By removing this one concept from the consideration set, it helped to relieve the mean squared error metric quite a bit: Although “only” one concept seems to primarily distort the results, it is nevertheless meaningful to dive further into it. After all, we do not want to get an entirely wrong share from 175 the simulator for just one product, even if it is just one product. In the section below we have looked at it in more depth. Should you be willing to skip this section we can already share with you that we find that brand sensitivity seems to be overestimated by CBC-HB and ACBC-HB and price sensitivity seems to be underestimated. This applies to all CBC and ACBC legs (although ACBC legs slightly less so). The one concept with a completely wrong MSE contribution is exactly the cheapest of the 20 concepts in that task, hence its predicted share is way lower than its stated actual share. CLUSTER ANALYSIS ON PREDICTED SHARES OF PREFERENCE FOR HOLDOUT TASKS A common way to cluster respondents based on conjoint data is by means of Latent Class analysis. In this way we get an understanding of differences between respondents regarding all attribute levels in the study. A somewhat different approach is to cluster respondents based on predicted shares of preference in the simulator. In this way we get an understanding of differences between respondents regarding their preferences for actual products in the market. While Latent Class is based on the entire spectrum of attribute levels, clustering on predicted shares narrows the clustering down to what is relevant in the current market. We took the latter approach and applied CCEA for clustering on the predicted shares of preference for the concepts in the three holdout tasks (as mentioned earlier, the holdout tasks were meant to be representative of a simulated complex market). We combined all eight study legs together to get a robust sample for this type of analysis. This was possible because the structure of the data and the scale of the data (shares of preference) is the same across all eight legs. In retrospect we checked if there were significant differences in cluster membership between the legs (there were some significant differences but these differences were not particularly big). The cluster analysis resulted in the following 2 clusters with each 5 sub-clusters, for a total of 10 clusters: 1. One cluster with respondents preferring a low end TV (up to some $600), about 1/3 of the sample This cluster consists of 5 sub-clusters based on brand—the 5 sub-clusters largely coincide with a strong preference for just one specific brand out of the 5 brands tested, although in some sub-clusters there was a preference for two brands jointly. 2. The other cluster with respondents preferring a mid/high end TV (starting at some $600), about 2/3 of the sample This cluster also consists of 5 sub-clusters based on brand, with the same remark as before for the low end TV cluster. The next step was to redo the MSE comparison by cluster and sub-cluster. Please note that the shares of preference within a cluster by definition will reflect the low end or mid/high end category and a certain brand preference, completely in line with the general description of the cluster. After all, the clustering was based on these shares of preference. The interesting thing is to see how the actual shares (based on actual answers in holdout tasks) behave by sub-cluster. The following is merely an example of one sub-cluster, namely for the “mid/high end Vizio cluster,” yet the same phenomenon applies in all of the other 9 subclusters. The following graph makes the comparison at a brand level (shares of the SKUs of each 176 brand together) and compares predicted shares of preference in the cluster with the actual shares in the cluster, for each of the three choice tasks. Since this is the “mid/high end Vizio cluster,” the predicted shares of preference for Vizio (in the first three rows) are by definition much higher than for the other brands and are also nearly equal for the three holdout tasks. However, the actual shares of Vizio in the holdout tasks (in the last three rows) are very similar to the shares of Samsung and Sony while varying a lot more in the three holdout tasks than in the predicted shares. So the predicted strong brand preference for Vizio is not reflected at all in the actual holdout tasks! The fact that the actual brand shares are quite different across the three tasks has a very likely cause in the particular specifications of the holdout tasks: 1. In the first holdout task, Panasonic had the cheapest product in the whole task, followed by Samsung. Both of these brands have a much higher share than predicted. 2. In the second holdout task, both Vizio and LG had the cheapest product in the whole task. Indeed Vizio has a higher actual share than in the other tasks (but not as overwhelmingly as in the prediction) while LG still has a low actual share (which at least confirms that people in this cluster dislike LG). 3. In the third holdout task, Samsung had the cheapest product in the whole task and has a much higher share than predicted. It is exactly this one concept (the cheapest one of Samsung) that was distorting the whole aggregate MSE analysis that we discussed in the previous section. Less clear in the second holdout task, but very clear in the first and third holdout task, it seems that we have a case here where the importance of brand is being overestimated while the importance of price is being underestimated. The actual choices are much more based on price than was predicted and much less based on brand. The earlier MSE analysis in which we found one concept with a big discrepancy between actual and predicted aggregate share perfectly fits into this. Please note though that this is evidence, not proof (we found a black swan but not all swans are black). 177 DISCUSSION AFTER THE CONFERENCE: WHICH METRIC TO USE? A discussion went on after the conference concerning how to evaluate and compare different conjoint methods: is it hit rates (as has been habitual in the conjoint world throughout the years) or is it predicted share of preference of the chosen concept? Hit rates clearly have an advantage of not being subject to scale issues, whereas shares of preference can be manipulated through a lower or higher scale factor in the utilities. Though you may well argue, why would anyone manipulate shares of preference with scale factors to begin with? Predicted shares of preference on the other hand clearly have an advantage of being a closer representation of the underlying logistic probability modeling. Speaking of this model, if the multinomial logistic model specification is correct (i.e., if answers to holdout tasks were to be regarded as random draws from these utility-based individual models), we could not even have a significant difference between the hit rate percentage and the mean share of preference. So reversing this reasoning, since these two measures deviate so much, apparently either the model specification or the estimation procedure contains some element of bias. It may be just the scale factor that is biased, or the problem may be bigger than that. We would definitely recommend some further academic investigation in this issue. CONCLUSIONS In general we were thrilled by some of these findings but less than enthusiastic about others. On a positive note, it was interesting to see that removing the screening section of ACBC does not harm its predictive strength. This can be important for studies in which shortening the length of interview is necessary (i.e., other methods are being used and require the respondent’s engagement or for cost purposes). While for this test we had a customized version of ACBC available, Sawtooth announced during the Conference that skipping the screening section as one of the options will be available to the “general public” in the next SSI Web version (v8.3). On a similar note, using a more narrow price range for ACBC did not hurt either (though nor did it beat standard ACBC) despite the fact that doing so creates more multicollinearity in the design since the total price shown to respondents is more in line with the sum of the concept’s feature prices. Across all methods, it was encouraging to find that all models beat random choice by a factor of 3–4, meaning that it is very well possible to predict respondent choice in complex choice situations such as purchasing one television out of twenty alternatives. One more double-edged learning is that while none of the ACBC alternatives outperformed standard ACBC, it just goes to show that ACBC already performs well as it is! In addition since many of the alternative legs were simplifications of the standard methodology, it shows that the method is robust enough to support simplifications yet still yield similar results. This may very well hint that much of the respondent’s non-compensatory choice behavior can be inferred from their choices between near-neighbors of their BYO concept. After all, the BYO concept is for all intents and purposes the “ideal” concept for the respondent. We had also hoped that pricebalanced CBC would help to improve standard CBC since it would force respondents to make trade-offs in areas other than just price, however this did not turn out to be the case. From our own internal perspective, it was quite promising to see that Dynamic CBC outperformed both CBC alternatives though again disappointing that it did not quite match the 178 predictive power of the ACBC legs tested. Despite not performing as well as ACBC, Dynamic CBC could still be a viable methodology to use in cases where something like a BYO exercise or screening section might not make sense in the context of the market being analyzed. In addition, further refinement of the method could possibly lead to results as well if not better than ACBC. Finally we were surprised to see that all of the 8 tested legs were very poor in predicting one of the concepts in one of the holdout tasks (although the ACBC legs did somewhat less poorly than the CBC legs). The underlying phenomenon seems to be that brand sensitivity is overestimated in all legs while price sensitivity is underestimated. This is something definitely to dig into further, and—who knows—may eventually lead to an entirely new conjoint approach beyond ACBC or CBC. NEXT STEPS As mentioned in the results section, there were some discrepancies between the actual holdout responses and the conjoint predictions that we would like to further investigate. Given the promise shown by our initial run of Dynamic CBC, we would like to further test more variants of it and in other markets as well. As a means of further testing our results, we would also like to double-check our findings concerning the increased number of cut-points on any past data that we may have from ACBC studies that included holdout tasks. Marco Hoogerbrugge Jeroen Hardon Christopher Fotenos REFERENCES Sawtooth Software (2009), “ACBC Technical Paper,” Sawtooth Software Technical Paper Series, Sequim, WA 179 RESEARCH SPACE AND REALISTIC PRICING IN SHELF LAYOUT CONJOINT (SLC) PETER KURZ1 TNS INFRATEST STEFAN BINNER BMS MARKETING RESEARCH + STRATEGY LEONHARD KEHL PREMIUM CHOICE RESEARCH & CONSULTING WHY TALK ABOUT SHELVES? For consumers, times long ago changed. Rather than being served by a shop assistant, superand hypermarkets have changed the way we buy products, especially fast-moving consumer goods (FMCGs). In most developed countries there is an overwhelming number of products for consumers to select from: “Traditional Trade” “Modern Trade” As marketers became aware of the importance of packaging design, assortment and positioning of their products on these huge shelves, researchers developed methods to test these new marketing mix elements. One example is a “shelf test” where respondents are interviewed in front of a real shelf about their reaction to the offered products. (In FMCG work, the products are often referred to as “stock keeping units” or “SKUs,” a term that emphasizes that each variation of flavor or package size is treated as a different product.) For a long time, conjoint analysis was not very good at mimicking such shelves in choice tasks: early versions of CBC were limited to a small number of concepts to be shown. Furthermore the philosophical approach for conjoint analysis, let’s call it the traditional conjoint approach, was driven by taking products apart into attributes and levels. However, this traditional approach missed some key elements in consumers’ choice situation in front of a modern FMCG shelf, e.g.: How does the packaging design of an SKU communicate the benefits (attribute levels) of a product? 1 Correspondence: Peter Kurz, Head of Research & Development TNS Infratest ([email protected]) Stefan Binner, Managing Director, bms marketing research + strategy ([email protected]) Leonhard Kehl, Managing Director, Premium Choice Research & Consulting ([email protected]) 181 How does an SKU perform in the complex competition with the other SKUs in the shelf? As it became easy for researchers to create shelf-like choice tasks (in 2013, among Sawtooth Software users who use CBC, 11% of their CBC projects employed shelf display) a new conjoint approach developed: “Shelf Layout Conjoint” or “SLC.” HOW IS SHELF LAYOUT CONJOINT DIFFERENT? The main differences between Traditional and Shelf Layout Conjoint are summarized in this chart: TRADITIONAL CONJOINT SHELF LAYOUT CONJOINT - Products or concepts consist usually of defined attribute levels - Communication of “attributes” through non-varying package design (instead of levels) - More rational or textual concept description (compared to packaging picture) - Visibility of all concepts at once - Including impact of assortment - Almost no impact of package design - Usually not too many concepts per task - Including impact of shelf position and number of facings - Information overflow many attributes—few concepts few visible attributes (mainly product and price—picture only)—many concepts Many approaches are used to represent a shelf in a conjoint task. Some are very simple: 182 Some are quite sophisticated: However, even the most sophisticated computerized visualization does not reflect the real situation of a consumer in a supermarket (Kurz 2008). In that paper, comparisons between a simple grid of products from which consumers make their choices and attempts to make the choice exercise more realistic by showing a store shelf in 3D showed no significant differences in the resulting preference share models. THE CHALLENGES OF SHELF LAYOUT CONJOINT Besides differences in the visualization of the shelves, there are different objectives SLCs can address, including: pricing product optimization portfolio optimization positioning layout promotion SLCs also differ in the complexity of their models and experimental designs, ranging from simple main effects models up to complex Discrete Choice Models (DCM’s) with lots of attributes and parameters to be estimated. Researchers often run into very complex models, with one attribute with a large number of levels (the SKU’s) and related to each of these levels one attribute (often, price) with a certain number of levels. Such designs could easily end up with several hundred parameters to be estimated. Furthermore, for complex experimental designs, layouts have to be generated in a special way, in order to retain realistic relationships between SKUs and realistic results. Socalled “alternative-specific designs” are often used in SLC, but that does not necessarily mean that it is always a good idea to estimate price effects as being alternative-specific. In terms of estimating utility values (under the assumption you estimate interaction effects, which lead to alternative-specific price effects), many different coding-schemes can be prepared which are mathematically identical. But, the experimental design behind the shelves is slightly different. Different design strategies affect how much level overlap occurs and therefore how efficient the 183 estimation of interactions can be. Good strategies to reduce this complexity in the estimation stage are crucial. With Shelf Layout Conjoint now easily available to every CBC user, we would like to encourage researchers to use this powerful tool. However, there are at least five critical questions in the design of Shelf Layout Conjoints which need to be addressed: Are the research objectives suitable for Shelf Layout Conjoint? What is the correct target group and SKU space (the “research space”)? Are the planned shelf layout choice tasks meaningful for respondents, and will they provide the desired information from their choices? Can we assure realistic inputs and results with regard to pricing? How can we build simulation models that provide reliable and meaningful results? As this list suggests, there are many topics, problems and possible solutions with Shelf Layout Conjoint. However, this paper focuses on only three of these very important issues. We take a practitioner’s, rather than an academic’s point of view. The three key areas we will address are: 1. Which are suitable research objectives? 2. How to define the research space? 3. How to handle pricing in a realistic way? SUITABLE RESEARCH OBJECTIVES FOR SHELF LAYOUT CONJOINT Evaluating suitable objectives requires that researchers be aware of all the limitations and obstacles Shelf Layout Conjoint has. So, we begin by introducing three of those key limitations. 1. Visualization of the test shelf. Real shelves always look different than test shelves. Furthermore there is naturally a difference between a 21˝ Screen and a 10 meter shelf in a real supermarket. The SKUs are shown much smaller than in reality. One cannot touch and feel them. 3D models and other approaches might help, but the basic issue still remains. 184 2. Realistic choices for consumers. Shelf Layout Conjoint creates an artificial distribution and awareness: All products are on the shelf; respondents are usually asked to look at or consider all of them. In addition, we usually simplify the market with our test shelf. In reality every distribution chain has different shelf offerings, which might further vary with the size of the individual store. In the real world, consumers often leave store A and go to store B if they do not like the offering (shop hopping). Sometimes products are out of stock, forcing consumers to buy something different. 3. Market predictions from Shelf Layout Conjoint. Shelf Layout Conjoint provides results from a single purchase simulation. We gain no insights about repurchase (did they like the product at all?) or future purchase frequency. In reality, promotions play a big role, not only in the shelf display but in other ways, for example, with second facings. It is very challenging to measure volumetric promotion effects such as “stocking up” purchases, but those play a big role in some product categories (Eagle 2010; Pandey, Wagner 2012). The complexity of “razor and blade” products, where manufacturers make their profit on the refill or consumable rather than on the basic product or tool, are another example of difficult obstacles researchers can be faced with. SUITABLE OBJECTIVES Despite these limitations and obstacles Shelf Layout Conjoint can provide powerful knowledge. It is just a matter of using it to address the right objectives; if you use it appropriately, it works very well! Usually suitable objectives for Shelf Layout Conjoint fall in the areas of either optimization of assortment or pricing. The optimization of assortment most often refers to such issues as: Line extensions with additional SKUs 185 What is the impact (share of choice) of the new product? Where does this share of choice come from (customer migration and/or cannibalization)? Which possible line extension has the highest consumer preference or leads to the best overall result for the total brand assortment or product line? Re-launch or substitution of existing SKUs What is the impact (share of choice) for the re-launch? Which possible re-launch alternative has the highest consumer preference? Which SKU should be substituted for? How does the result of the re-launch compare to a line extension? Branding What is the effect of re-branding a line of products? What is the effect of the market entry of new brands or competitors? The optimization of pricing most often involves questions like: Price positioning What is the impact of different prices on share of choice and profit? How should the different SKUs within an assortment be priced? How will the market react to a competitor’s price changes? Promotions What is the impact (sensitivity) to promotions? Which SKUs have the highest promotion effect? How much price reduction is necessary to create a promotion effect? Indirect pricing What is the impact of different contents (i.e., package sizes) on share of choice and profit? How should the contents of different SKUs within an assortment be defined? How will the market react to competitors’ changes in contents? On the other hand there are research objectives which are problematic, or at least challenging, for Shelf Layout Conjoint. Some examples include: 186 Market size forecasts for a sales period Volumetric promotion effects Multi category purchase/TURF-like goals Positioning of products on the shelf Development of package design Evaluation of new product concepts New product development Not all of the above research objectives are impossible, but they at least require very tailored or cautious approaches. DEFINITION OF THE CORRECT MARKET SCOPE By the terminology “market scope” we mean the research space of the Shelf Layout Conjoint. Market scope can be defined by three questions, which are somewhat related to each other: What SKUs do we show on the Shelf? What consumers do we interview? What do we actually ask them to do? => SKU Space => Target Group => Context SKU Space Possible solutions to this basic problem depend heavily on the specific market and product category. Two main types of solutions are: 1. Focusing the SKU space on or by market segments Such focus could be achieved by narrowing the SKU space to just a part of the market such as - distribution channel (no shop has all SKUs) - product subcategories (e.g., product types such as solid vs. liquid) - market segments (e.g., premium or value for money) This will provide more meaningful results for the targeted market segment. However it may miss migration effects from (or to) other segments. Furthermore such segments might be quite artificial from the consumer’s point of view. Alternatively one could focus on the most relevant products only (80:20 rule). 2. Strategies to cope with too many SKUs 187 When there are more SKUs than can be shown on the screen or understood by the respondents, strategies might include - Prior SKU selection (consideration sets) - Multiple models for subcategories - Partial shelves Further considerations for the SKU space: - Private labels (which SKUs could represent this segment?) - Out of stock effects (whether and how to include them?) Target Group The definition of the target group must align with the SKU space. If there is a market segment focus, then obviously the target group should only include customers in that segment. Conversely, if there are strategies for huge numbers of SKUs all relevant customers should be included. There are still other questions about the target group which should also be addressed, including: - Current buyers only or also non-buyers? - Quotas on recently used brands or SKUs? - Quotas on distribution channel? - Quotas on the purchase occasion? Context Once the SKU space and the target group are defined the final element of “market scope” is to create realistic, meaningful choice tasks: 1. Setting the scene: - Introduction of new SKUs - Advertising simulation 2. Visualization of the shelf - Shelf Layout (brand blocks, multiple facings) - Line pricing/promotions - Possibility to enlarge or “examine” products 3. The conjoint/choice question - What exactly is the task? - Setting a purchase scenario or not? - Single choice or volumetric measurement? PRICING Pricing is one of the most, if not the most, important topic in Shelf Layout Conjoint. In nearly all SLCs some kind of pricing issue is included as an objective. But “pricing” does not mean just one common approach. Research questions in regard to pricing are very different between different studies. They start with easy questions about the “right price” of a single SKU. They often include the pricing of whole product portfolios, including different pack sizes, flavors and 188 variants and may extend to complicated objectives like determining the best promotion price and the impact of different price tags. Before designing a Shelf Layout Conjoint researchers must therefore have a clear answer to the question: “How can we obtain realistic input on pricing?” Realistic pricing does not simply mean that one needs to design around the correct regular sales price. It also requires a clear understanding of whether the following topics play a role in the research context. Topic 1: Market Relevant Pricing The main issue of this topic is to investigate the context in which the pricing scenario takes place. Usually such an investigation starts with the determination of actual sales prices. At first glance, this seems very easy and not worth a lot of time. However, most products are not priced with a single regular sales price. For example, there are different prices in different sales channels or store brands. Most products have many different actual sales prices. Therefore one must start with a closer look at scanner data or list prices of the products in the SKU space. As a next step, one has to get a clear understanding of the environment in which the product is sold. Are there different channels like hypermarkets, supermarkets, traders, etc. that have to be taken into account? In the real world, prices are often too different across channels to be used in only one Shelf Layout Conjoint design. So we often end up with different conjoint models for the different channels. Furthermore, the different store brands may play a role. Store brand A might have a different pricing strategy because it competes with a different set of SKUs than store brand B. How relevant are the different private labels or white-label/generic products in the researched market? In consequence one often ends up with more than one Shelf Layout Conjoint model (perhaps even dozens of models) for one “simple” pricing context. In such a situation, researchers have to decide whether to simulate each model independently or to build up a more complex simulator. This will allow pricing simulations on an overall market level, tying together the large number of choice models, to construct a realistic “playground” for market simulations. Topic 2: Initial Price Position of New SKUs With the simulated launch of new products one has to make prior assumptions about their pricing before the survey is fielded. Thus, one of the important tasks for the researcher and her client is to define reasonable price points for the new products in the model. The price range must be as wide as necessary, but as narrow as possible. Topic 3: Definition of Price Range Widths and Steps Shelf Layout Conjoint should cover all possible pricing scenarios that would be interesting for the client. However, respondents should not be confronted with unrealistically high or low prices. Such extremes might artificially influence the choices of the respondent and might have an influence on the measured price elasticity. Unrealistically high price elasticity is usually caused by too wide a price range, with extremely cheap or extremely expensive prices. One should be aware that the price range over which an SKU is studied has a direct impact on its elasticity results! This is not only true for new products, where respondents have no real price 189 knowledge, but also for existing products. Furthermore unrealistically low or high price points can result in less attention from respondents and more fatigue in the answers of respondents, than realistic price changes would have caused. Topic 4: Assortment Pricing (Line Pricing) Many clients have not just a single product in the market, but a complete line of competing products on the same shelf. In such cases it is often important to price the products in relation to each other. A specific issue in this regard is line pricing: several products of one supplier share the same price, but differ in their contents (package sizes) or other characteristics. Many researchers measure the utility of prices independently for each SKU and create line pricing only in the simulation stage. However, in this situation, it is essential to use line-priced choice tasks in the interview: respondents’ preference structure can be very different when seeing the same prices for all products of one manufacturer rather than seeing different prices, which often results in choosing the least expensive product. This leads to overestimation of preference shares for cheaper products. A similar effect can be observed if the relative price separations of products are not respected in the choice tasks. For example: if one always sells orange juice for 50 cents more than water, this relative price distance is known or learned by consumers and taken into account when they state their preference in the choice tasks. Special pricing designs such as line pricing can be constructed by exporting the standard design created with Sawtooth Software’s CBC into a CSV format and reworking it in Excel. However, one must test the manipulated designs afterwards in order to ensure the prior design criteria are still met. This is done by re-importing the modified design into Sawtooth Software’s CBC and running efficiency tests. Topic 5: Indirect Pricing In markets where most brands offer line pricing the real individual price positioning of SKUs is often achieved through variation in their package content sizes. This variation can be varied and modeled in the same way and with the same limitations as monetary prices. However, one must ensure that the content information is sufficiently visible to the consumers (e.g., written on price tags or on the product images). Topic 6: Price Tags Traditionally, prices in conjoint are treated like an attribute level and are simply displayed beneath each concept. Therefore in many Shelf Layout Conjoint projects the price tag is simply the representation of the actual market price and the selected range around it. However, in reality consumers see the product name, content size, number of applications, price per application or 190 standard unit in addition to the purchase price. (In the European Community, such information is mandatory by law; in many other places, it is at least customary if not required.) In choice tasks, many respondents therefore search not only for the purchase price, but also for additional information about the SKUs in their relevant choice set. Oversimplification of price tags in Shelf Layout Conjoint does not sufficiently reflect the real decision process. Therefore, it is essential to include the usual additional information to ensure realistic choice tasks for respondents. Topic 7: Promotions The subject of promotions in Shelf Layout Conjoint is often discussed but controversial. In our opinion, only some effects of promotions can be measured and modeled in Shelf Layout Conjoint. SLC provides a one-point-in-time measurement of consumer preference. Thus, promotion effects which require information over a time period of consumer choices cannot be measured with SLC. It is essential to keep in mind that we can neither answer the question if a promotion campaign results in higher sales volume for the client nor make assumptions about market expansion—we simply do not know anything about the purchase behavior (what and how much) in the future period. However, SLC can simulate customers’ reaction to different promotion activities. This includes the simulation of the necessary price discount in order to achieve a promotion effect, comparison of the effectiveness of different promotion types (e.g., buy two, get one free) as well as simulation of competitive reactions, but only at a single point in time. In order to analyze such promotion effects with high accuracy, we recommend applying different attributes and levels for the promotional offers from those for the usual market prices. SLC including promotion effects therefore often have two sets of price parameters, one for the regular market price and one for the promotional price. Topic 8: Price Elasticity Price Elasticity is a coefficient which tells us how sales volume changes when prices are changed. However, one cannot predict sales figures from SLC. What we get is “share of preference” or “share of choice” and we know whether more or fewer people are probably purchasing when prices change. In categories with single-unit purchase cycles, this is not much of a problem, but in the case of Fast Moving Consumer Goods (FMCG) with large shelves where consumers often buy more than one unit—especially under promotion—it is very critical to be precise when talking about price elasticity. We recommend speaking carefully of a “price to share of preference coefficient” unless sales figures are used in addition to derive true price elasticity. 191 The number of SKUs included in the research has a strong impact on the “price to share of preference coefficient.” The fewer SKUs one includes in the model, the higher the ratio; many researchers experience high “ratios” that are only due to the research design. But they are certainly wrong if the client wants to know the true “coefficient of elasticity” based on sales figures. Topic 9: Complexity of the Model SLC models are normally far more complex than the usual CBC/DCM models. The basic structure of SLC is usually one many-leveled SKU attribute and for each (or most) of its levels, one price attribute. Sometimes there are additional attributes for each SKU such as promotion or content. As a consequence there are often too many parameters to be estimated in HB. Statistically, we have “over-parameterization” of the model. However there are approaches to reduce the number of estimated parameters, e.g.: Do we need part-worth estimates for each price point? Could we use a linear price function? Do we really need price variation for all SKUs? Could we use fixed price points for some competitors’ SKUs? Could we model different price effects by price tiers (such as low, mid, high) instead of one price attribute per SKU? Depending on the quantity of information one can obtain from a single respondent, it may be better to use aggregate models than HB models. The question is, how many tasks could one really ask of a single respondent before reaching her individual choice task threshold, and how many concepts could be displayed on one screen (Kurz, Binner 2012)? If it’s not possible to show respondents a large enough number of choice tasks to get good individual information, relative to the large number of parameters in an SLC model, HB utility estimates will fail to capture much heterogeneity anyway. TOPICS BEYOND THIS PAPER How can researchers further ensure that Shelf Layout Conjoint provides reliable and meaningful results? Here are some additional topics: Sample size and number of tasks Block designs Static SKUs Maximum number of SKUs on shelf Choice task thresholds Bridged models Usage of different (more informative) priors in HB to obtain better estimates EIGHT KEY TAKE-AWAYS FOR SLC 1. Be aware of its limitations when considering Shelf Layout Conjoint as a methodology for your customers’ objectives. One cannot address every research question. 192 2. Try hard to ensure that your pricing accurately reflects the market reality. If one model is not possible, use multi-model simulations or single market segments. 3. Be aware that the price range definition for a SKU has a direct impact on its elasticity results and define realistic price ranges with care. 4. Adapt your design to the market reality (e.g., line pricing), starting with the choice tasks and not only in your simulations. 5. Do not oversimplify price tags in Shelf Layout Conjoint; be sure to sufficiently reflect the real decision environment. 6. SLC provides just a one point measurement of consumer preference. Promotion effects that require information about a time period of consumer choices cannot be measured. 7. Price elasticity derived from SLC is better called “price to share of preference coefficients.” 8. SLC often suffers from “over-parameterization” within the model. One should evaluate different approaches to reduce the number of estimated parameters. Peter Kurz Stefan Binner Leonhard Kehl REFERENCES Eagle, Tom (2010): Modeling Demand Using Simple Methods: Joint Discrete/Continuous Modeling; 2010 Sawtooth Software Conference Proceedings. Kehl, Leonhard; Foscht, Thomas; Schloffer, Judith (2010): Conjoint Design and the impact on Price-Elasticity and Validity; Sawtooth Europe Conference Cologne. Kurz, Peter; Binner, Stefan (2012): “The Individual Choice Task Threshold” Need for Variable Number of Choice Tasks; 2012 Sawtooth Software Conference Proceedings. Kurz, Peter (2008): A comparison between Discrete Choice Models based on virtual shelves and flat shelf layouts; SKIM Working Towards Symbioses Conference Barcelona. Orme, Bryan (2003): Special Features of CBC Software for Packaged Goods and Beverage Research; Sawtooth Software, via Website. 193 Pandey, Rohit; Wagner, John; Knappenberger, Robyn; (2012): Building Expandable Consumption into a Share-Only MNL Model; 2012 Sawtooth Software Conference Proceedings. 194 ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS DAN YARDLEY MARITZ RESEARCH EXECUTIVE SUMMARY Some respondents ignore certain attributes in choice experiments to help them choose between competing alternatives. By asking the respondents which attributes they ignored and accounting for this attribute non-attendance we hope to improve our preference models. We test different ways of asking stated non-attendance and the impact of non-attendance on partial profile designs. We also explore an empirical method of identifying and accounting for attribute non-attendance. We found that accounting for stated attribute non-attendance does not improve our models. Identifying and accounting for non-attendance empirically can result in better predictions on holdouts. BACKGROUND Recent research literature has included discussions of “attribute non-attendance.” In the case of stated non-attendance (SNA) we ask respondents after they answer their choice questions, which attributes, if any, they ignored when they made their choices. This additional information can then be used to zero-out the effect of the ignored attributes. Taking SNA into account theoretically improves model fit. We seek to summarize this literature, to replicate findings using data from two recent choice experiments, and to test whether taking SNA into account improves predictions of in-sample and out-of-sample holdout choices. We also explore different methods of incorporating SNA into our preference model and different ways of asking SNA questions. In addition to asking different stated non-attendance questions, we will also compare SNA to stated attribute level ratings (two different scales tested) and self-explicated importance allocations. Though it’s controversial, some researchers have used latent class analysis to identify nonattendance analytically (Hensher and Greene, 2010). We will attempt to identify non-attendance by using HB analysis and other methods. We will determine if accounting for “derived nonattendance” improves aggregate model fit and holdout predictions. We will compare “derived non-attendance” to the different methods of asking SNA and stated importance. It’s likely that non-attendance is more of an issue for full profile than for partial profile experiments, another hypothesis our research design will allow us to test. By varying the attributes shown, we would expect respondents to pay closer attention to which attributes are presented, and thus ignore fewer attributes. STUDY 1 We first look at the data from a tablet computer study conducted in May 2012. Respondents were tablet computer owners or intenders. 502 respondents saw 18 full profile choice tasks with 3 alternatives and 8 attributes. The attributes and levels are as follows: 195 Attribute Level 1 Operating System Apple Memory 8 GB Included Cloud Storage None Price $199 Camera Picture Quality 0.3 Megapixels Warranty 3 Months Screen Size 5" High definition display Screen Resolution (200 pixels per inch) Level 2 Android 16 GB 5 GB $499 2 Megapixels 1 Year 7" Extra-high definition display (300 pixels per inch) Level 3 Windows 64 GB 50 GB $799 5 Megapixels 3 Years 10" Another group of 501 respondents saw 16 partial profile tasks based upon the same attributes and levels as above. They saw 4 attributes at a time. 300 more respondents completed a differently formatted set of partial profile tasks and were designated for use as an out-of-sample holdout. All respondents saw the same set of 6 full profile holdout tasks and a stated nonattendance question. The stated non-attendance question and responses were: Please indicate which of the attributes, if any, you ignored when you made your choices in the preceding questions: Memory Operating System Screen Resolution Price Screen Size Warranty Camera Picture Quality Included Cloud Storage I did not ignore any of these Average # Ignored Full Profile 17.5% 20.9% 22.7% 26.9% 27.3% 27.9% 30.1% 36.7% 26.1% Partial Profile 14.2% 18.6% 14.0% 16.4% 17.2% 24.2% 21.8% 28.9% 36.3% 2.10 1.55 Full Profile 8 7 6 5 4 3 2 1 Partial Profile 7 4 8 6 5 2 3 1 We can see that the respondents who saw partial profile choice tasks answered the stated nonattendance question differently from those who saw full profile. The attribute rankings are different for partial and full profile, and 6 of the 9 attribute frequencies are significantly different (bold). Significantly more partial profile respondents stated that they did not ignore any of the attributes. From these differences we conclude that partial profile respondents pay closer attention to which attributes are showing. This is due to the fact that the attributes shown vary from task to task. We now compare aggregate (pooled) multinomial logistic models created using data augmented with the stated non-attendance (SNA) question and models with no SNA. The way we account for the SNA question is: if the respondent said they ignored the attribute, we zero out 196 the attribute in the design, effectively removing the attribute for that respondent. Accounting for SNA in this manner yields mixed results. Holdout Tasks Full Profile 56.7% 56.9% No SANA With SANA Likelihood Ratio Full Profile No SANA 1768 With SANA 2080 Partial 53.1% 51.6% Partial 2259 2197 Out of Sample Full Profile 47.7% 52.0% No SANA With SANA Partial 55.4% 55.4% For the full profile respondents, we see slight improvement in Holdout Tasks hit rates from 56.7% to 56.9% (applying the aggregate logit parameters to predict individual holdout choices). Out-of-sample hit rates (applying the aggregate logit parameters from the training set to each out-of-sample respondent’s choices) and Likelihood Ratio (2 times the difference in the model LL compared to the null LL) also show improvement. For partial profile respondents, accounting for the SNA did not lead to better results. It should be noted that partial profile performed better on out-of-sample hit rates. This, in part, is due to the fact that the out-of-sample tasks are partial profile. Similarly, the full profile respondents performed better on holdout task hit rates due to the holdout tasks being full profile tasks. Looking at the resulting parameter estimates we see little to no differences between models that account for SNA (dashed lines) and those that don’t. We do, however, see slight differences between partial profile and full profile. 2 1.5 1 0.5 0 FP FP SNA -0.5 PP -1 PP SNA -1.5 OS Memory Cloud Storage Price Screen Res Camera (Mpx) Warranty 10" 7" 5" 3Yr 1Yr 3Mo 5.0 2.0 0.3 XHD HD $799 $499 $199 50GB 5GB 0GB 64GB 16GB 8GB Win And App -2 Screen Size Shifting from aggregate models to Hierarchical Bayes, again we see that accounting for SNA does not lead to improved models. In addition to accounting for SNA by zeroing out the ignored attributes in the design (Pre), we also look at zeroing out the resulting HB parameter estimates 197 (Post). Zeroing out HB parameters (Post) has a negative impact on holdout tasks. Not accounting for SNA resulted in better holdout task hit rates and root likelihood (RLH). Holdout Tasks RLH Full Profile No SNA 72.8% Pre SNA 71.1% Post SNA 66.6% Partial 63.2% 61.1% 59.4% Full Profile No SNA 0.701 With SNA 0.633 Partial 0.622 0.565 Lots of information is obtained by asking respondents multiple choice tasks. With all this information at our fingertips, is it really necessary to ask additional questions about attribute attendance? Perhaps we can derive attribute attendance empirically, and improve our models as well. One simple method of calculating attribute attendance that we tested is to compare each attribute’s utility range from the Hierarchical Bayes models to the attribute with the highest utility range. Below are the utility ranges for the first three full profile respondents from our data. Utilities Range ID 6 8 12 Operating System 2.73 0.14 3.07 Memory 1.89 0.64 0.64 Cloud Storage 1.52 0.04 0.16 Price 1.82 9.87 1.36 Screen Camera Resolution Megapixels Warranty Screen Size 1.01 2.65 0.84 1.13 0.02 0.68 0.38 0.51 0.12 1.60 0.91 0.94 With a utility range for Price of 9.87 and all other attribute ranges of less than 1, we can safely say that the respondent with ID 8 did not ignore Price. The question becomes, at what point are attributes being ignored? We analyze the data at various cut points. For each cut point we find the attribute for each individual with the largest range and then assume everything below the cut point is ignored by the individual. For example, if the utility range of Price is the largest at 9.87, a 10% cut point drops all attributes with a utility range of .98 and smaller. Below we have identified for our three respondents (grayed out) the attributes they empirically ignored at the 10% cut point. At this cut point we would say the respondent ID 6 ignored none of the attributes, ID 8 ignored 7, and ID 12 ignored 2 of the 8 attributes. Utilities Range ID 6 8 12 Operating System 2.73 0.14 3.07 10% Cut Point Memory 1.89 0.64 0.64 Cloud Storage 1.52 0.04 0.16 Price 1.82 9.87 1.36 Screen Camera Resolution Megapixels Warranty Screen Size 1.01 2.65 0.84 1.13 0.02 0.68 0.38 0.51 0.12 1.60 0.91 0.94 We now analyze different cut points to see if we can improve the fit and predictive ability of our models, and to find an optimal cut point. 198 Average Hit Rate Empirical Attribute Non Attendance 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Internal - FP Internal - PP Hold Out - FP Hold Out - PP None 10% 20% 30% 40% 50% Cutting Point Below Max We see that as the cut point increases so does the hit rate on the holdout tasks. An optimal cut point is not obvious. We will use a 10% cut point as a conservative look and 45% as an aggressive cut point. Looking further at the empirical 10% and 45% cut points in the table below, we see that for full profile we get an improved mean absolute error (MAE). For partial profile the MAE stays about the same. Empirically accounting for attribute non-attendance improved our models while accounting for stated non-attendance did not. MAE No SNA SNA Emp 10 Emp 45 Full Profile 0.570 0.686 0.568 0.534 Partial 1.082 1.109 1.082 1.089 From this first study we see that respondents pay more attention to partial profile type designs and conclude that, accounting for stated non-attendance does not improve our models. We wanted to further explore these findings and test our empirical methods so we conducted a second study. STUDY 2 The second study we conducted was a web-based survey fielded in March 2013. The topic of this study was “significant others” and included respondents that were interested in finding a spouse, partner, or significant other. We asked 2,000 respondents 12 choice tasks with 8 attributes and 5 alternatives. We also asked all respondents the same 3 holdout tasks and a 100 point attribute importance allocation question. Respondents were randomly assigned to 1 of 4 cells of a 2x2 design. The 2x2 design was comprised of 2 different stated non-attendance questions, and 2 desirability scale rating questions. The following table shows the attributes and levels considered for this study. 199 Attribute Attractiveness Romantic/Passionate Honesty/Loyalty Funny Intelligence Political Views Religious Views Annual Income Levels Not Very Attractive Not Very Romantic/Passionate Can't Trust Not Funny Not Very Smart Strong Republican Christian $15,000 $40,000 Somewhat Attractive Somewhat Romantic/Passionate Mostly Trust Sometimes Funny Pretty Smart Swing Voter Religious - Not Christian $65,000 $100,000 Very Attractive Very Romantic/Passionate Completely Trust Very Funny Brilliant Strong Democrat No Religion/Secular $200,000 We wanted to test different ways of asking attribute non-attendance. Like our first study, we asked half the respondents which attributes they ignored during the choice tasks: “Which of these attributes, if any, did you ignore in making your choice about a significant other?” The other half of respondents were asked which attributes they used: “Which of these attributes did you use in making your choice about a significant other?” For each attribute in the choice study, we asked each respondent to rate all levels of the attribute on desirability. Half of the respondents were asked a 1–5 Scale the other half a 0–10 Scale. Below are examples of these rating questions for the Attractiveness attribute. 1–5 Scale Still thinking about a possible significant other, how desirable is each of the following levels of their Attractiveness? Completely Not Very Highly Extremely No Opinion/Not Desirable Unacceptable Desirable Desirable Desirable Relevant (1) (2) (3) (5) (4) Not Very Attractive Somewhat Attractive Very Attractive m m m m m m m m m m m m m m m m m m 0–10 Scale Still thinking about a possible significant other, how desirable is each of the following levels of their Attractiveness? Not Very Attractive Somewhat Attractive Very Attractive Extremely Undesireable (0) (1) (2) (3) (4) (5) (6) (7) (8) m m m m m m m m m m m m m m m m m m m m m m m m m m m Extremely No Opinion/Not Desirable Relevant (10) (9) m m m m m m m m m As previously mentioned, all respondents were asked an attribute importance allocation question. This question was asked of all respondents so that we could use it as a base measure for comparisons. The question asked: “For each of the attributes shown, how important is it for your significant other to have the more desirable versus less desirable level? Please allocate 100 points among the attributes, with the most important attributes having the most points. If an attribute has no importance to you at all, it should be assigned an importance of 0.” We now compare the results for the different non-attendance methods. All methods easily identify Honesty/Loyalty as the most influential attribute. 86.8% of respondents asked if they used Honesty/Loyalty in making their choice marked they did. Annual Income is the least influential attribute for the stated questions, and second-to-least for the allocation question. Since 200 Choice, 1–5 Scale and 0–10 Scale are derived, they don’t show this same social bias, and Annual Income ranks differently. Choice looks at the average range of the attribute’s HB coefficients. Similarly, 1–5 and 0–10 Scale looks at the average range of the ranked levels. For the 0–10 Scale, the average range between “Can’t Trust” and “Completely Trust” is 8.7. Stated Used Honesty/Loyalty 86.8% Funny 53.7% Stated Ignored 8.1% 17.4% 1-5 Scale 3.44 2.11 0-10 Scale Allocation Choice 8.70 31.3% 5.93 5.42 10.0% 1.20 Intelligence 52.1% 17.0% 1.81 4.75 10.6% 1.24 48.7% 44.7% 39.3% 28.0% 20.1% 15.4% 25.6% 26.7% 35.9% 44.2% 2.06 1.81 0.88 0.33 1.88 5.12 4.18 2.12 1.11 4.22 12.5% 13.8% 9.7% 5.3% 6.8% 0.96 1.44 0.82 0.50 1.63 Attribute Rankings Stated Used Honesty/Loyalty 1 Funny 2 Intelligence 3 Romantic/Passionate 4 Attractiveness 5 Religious Views 6 Political Views 7 Annual Income 8 Stated Ignored 1 4 3 2 5 6 7 8 1-5 Scale 1 2 5 3 6 7 8 4 0-10 Scale Allocation Choice 1 1 1 2 5 5 4 4 4 3 3 6 6 2 3 7 6 7 8 8 8 5 7 2 Romantic/Passionate Attractiveness Religious Views Political Views Annual Income In addition to social bias, another deficiency with Stated Used and Stated Ignored questions is that some respondents don’t check all the boxes they should, thus understating usage of attributes for Stated Used, and overstating usage for Stated Ignored. The respondents, who were asked which attributes they used, selected on average 3.72 attributes. The average number of attributes that respondents said they ignored is 1.92, implying they used 6.08 of the 8 attributes. If respondents carefully considered these questions, and truthfully marked all that applied, the difference between Stated Used and Stated Ignored would be smaller. For each type of stated non-attendance question, we analyzed each attribute for all respondents and determined if the attribute was ignored or not. The following table shows the differences across the methods of identifying attribute non-attendance. For the allocation question, if the respondent allocated 0 of the 100 points, the attribute was considered ignored. For 1–5 Scale and 0–10 Scale if the range of the levels of the attribute was 0, the attribute was considered ignored. The table below shows percent discordance between methods. For example, the Allocate method identified 35.1% incorrectly from the Stated Used method. Overall, we see 201 the methods differ substantially one from another. The most diverse methods are the Empirical 10% and the Stated Used with 47.6% discordance. % Discordance Stated Stated Used Ignore Allocate 35.1% 24.9% Stated Used Stated Ignore 1-5 Scale 0-10 Scale 1-5 Scale 22.3% 39.3% 23.2% 0-10 Scale Emp 10% 21.3% 28.8% 41.6% 47.6% 22.2% 27.9% 23.6% 22.4% Accounting for these diverse methods of attribute non-attendance in our models, we can see the impact on holdout hit rates is small, and only the 1–5 Scale shows improvement over the model without stated non-attendance. Hit Rates No SNA Holdout 1 55.6% Holdout 2 57.1% Holdout 3 61.8% Average 58.2% Allocate 55.2% 58.1% 61.3% Stated Used 51.4% 54.1% 57.6% Stated Ignore 56.4% 54.9% 62.1% 1-5 Scale 58.8% 57.0% 61.5% 0-10 Scale 52.8% 56.5% 61.2% 58.2% 54.3% 57.8% 59.1% 56.8% When we account for attribute non-attendance empirically, using the method previously described, we see improvement in the holdouts. Again we see, in the table below, that as we increase the cut point, zeroing out more attributes, the holdouts increase. Instead of accounting for attribute non-attendance by asking additional questions we can efficiently do so empirically using the choice data. Hit Rates Cut Point None Holdout 58.2% Internal 91.0% 10% 58.7% 90.3% 20% 59.0% 86.5% 30% 62.4% 81.6% 40% 64.8% 76.8% 50% 67.0% 73.5% CONCLUSIONS When asked: “Please indicate which of the attributes, if any, you ignored when you made your choices in the preceding questions” respondents previously asked discrete choice questions of a partial profile type design, indicated they ignored fewer attributes than those asked full profile choice questions. Partial profile type designs solicit more attentive responses than full profile designs. This, we believe, is due to the fact that the attributes shown change, demanding more of the respondent’s attention. Aggregate and Hierarchical Bayes Models typically do not perform better when we account for stated attribute non-attendance. Accounting for questions directly asking which attributes were ignored performs better than asking which attributes were used. Derived methods eliminate the social bias of direct questions and, when accounted for in the models, tend to perform better than the direct questions. Respondents use a different thought process when answering stated 202 attribute non-attendance questions than choice tasks. Combing different question types pollutes the interpretation of the models and is discouraged. A simple empirical way to account for attribute non-attendance is to look at the HB utilities range and zero out attributes with relatively small ranges. Models where we identify nonattendance empirically perform better on holdouts and benefit from not needing additional questions. Dan Yardley PREVIOUS LITERATURE Alemu, M.H., M.R Morkbak, S.B. Olsen, and C.L Jensen (2011) “Attending to the reasons for attribute non-attendance in choice experiments,” FOI working paper, University of Copenhagen. Balcombe, K., M. Burton and D. Rigby (2011) “Skew and attribute non-attendance within the Bayesian mixed logit model,” paper presented at the International Choice Modeling Conference. Cameron, T.A. and J.R. DeShazo (2010) “Differential attention to attributes in utility-theoretic choice models,” Journal of Choice Modeling, 3: 73–115. Campbell, D., D.A. Hensher and R. Scarpa (2011) “Non-attendance to attributes in environmental choice analysis: a latent class specification,” Journal of Environmental Planning and Management, 54, 2061–1076. Hensher, D.A. and A.T. Collins (2011) “Interrogation of responses to stated choices experiments: is there sense in what respondents tell us?” Journal of Choice Modeling, 4: 62–89. Hensher, D.A. and W.H. Greene (2010) “Non-attendance and dual processing of common-metric attributes in choice analysis: A Latent Class Specification,” Empirical Economics, 39, 413– 426. Hensher, D.A. and J. Rose (2009) “Simplifying choice through attribute preservation or nonattendance: implications for willingness to pay,” Transportation Research E, 45, 583–590. Hensher, D.A., J. Rose and W.H. Greene (2005) “The implications on willingness to pay of respondents ignoring specific attributes,” Transportation, 32, 203–222. 203 Scarpa, R., T.J. Gilbride, D. Campbell, D. and D.A. Hensher (2009) “Modelling attribute nonattendance in choice experiments for rural landscape valuation,” European Review of Agricultural Economics, 36, 151–174. Scarpa, R., R. Raffaelli, S. Notaro and J. Louviere (2011) “Modelling the effects of stated attribute non-attendance on its inference: an application to visitors benefits from the alpine grazing commons,” paper presented at the International Choice Modeling Conference. 204 ANCHORED ADAPTIVE MAXDIFF: APPLICATION IN CONTINUOUS CONCEPT TEST ROSANNA MAU JANE TANG LEANN HELMRICH MAGGIE COURNOYER VISION CRITICAL SUMMARY Innovative firms with a large number of potential new products often set up continuous programs to test these concepts in waves as they are developed. The test program usually assesses these concepts using monadic or sequential monadic ratings. It is important that the results be comparable not just within each wave but across waves as well. The results of all the testing are used to build a normative database and select the best ideas for implementation. MaxDiff is superior to ratings, but is not well suited for tracking across the multiple waves of a continuous testing program. This can be addressed by using an Anchored Adaptive MaxDiff approach. The use of anchoring transforms relative preferences into an absolute scale, which is comparable across waves. Our results show that while there are strong consistencies between the two methodologies, the concepts are more clearly differentiated through their anchored MaxDiff scores. Concepts that were later proven to be successes also seemed to be more clearly identified using the Anchored approach. It is time to bring MaxDiff into the area of continuous concept testing. 1. INTRODUCTION Concept testing is one of the most commonly used tools for new product development. Firms with a large number of ideas to test usually have one or more continuous concept test programs. Rather than testing a large number of concepts in one go, a small group of concepts are developed and tested at regular intervals. Traditional monadic or sequential monadic concept testing methodology—based on rating scales—is well suited for this type of program. Respondents rate each concept one at a time on an absolute scale, and the results can be compared within each wave and across waves as well. Over time, the testing program builds up a normative database that is used to identify the best candidates for the next stage of development. To ensure that results are truly comparable across waves, the approach used in all of the waves must be consistent. Some of the most important components that must be monitored include: Study design—A sequential monadic set up is often used for this type of program. Each respondent should be exposed to a fixed number of concepts in each wave. The number of respondents seeing each concept should also be approximately the same. Sample design and qualification—The sample specification and qualifying criteria should be consistent between waves. The source of sample should also remain stable. River samples and router samples suffer from lack of control over sample composition and therefore are not suitable 205 for this purpose. Network samples, where the sample is made up of several different panel suppliers, need to be controlled so that similar proportions come from each panel supplier each time. Number and format of the concept tested—The number of concepts tested should be similar between waves. If a really large number of items need to be tested, new waves should be added. The concepts should be at about the same stage of concept development. The format of the concepts, for example, image with a text description, should also remain consistent. Questionnaire design and reporting—In a sequential monadic set-up, respondents are randomly assigned to evaluate a fixed number of concepts and they see one concept at a time. The order in which the concepts are presented is randomized. Respondents are asked to evaluate the concept using a scale to rate their “interest” or “likelihood of purchase.” In the reporting stage, the key reporting statistic used to assess the preference for concepts is determined. This could be, for example, the top 2 box ratings on a 5-point Likert scale of purchase interest. This reporting statistic, once determined, should remain consistent across waves. In this type of sequential monadic approach, each concept is tested independently. Over time, the key reporting statistics are compiled for all of the concepts tested. This allows for the establishment of norms and action standards—the determination of “how good is good?” Such standards are essential for identifying the best ideas for promotion to the next stage of product development. However, it is at this stage that difficulties often appear. A commonly encountered problem is rating scale statistics offer only a very small differentiation between concepts. If the difference between the second quartile and the top decile is 6%—while the margin of error is around 8%—how can we really tell if a concept is merely good, or if it is excellent? Traditional rating scales do not always allow us to clearly identify which concepts are truly preferred by potential customers. Another method is needed to establish respondent preference. MaxDiff seems to be an obvious choice. 2. MAXDIFF MaxDiff methodology has shown itself to be superior to ratings (Cohen 2003; Chrzan & Golovashkina 2006). Instead of using a scale, respondents are shown a subset of concepts and asked to choose the one they like best and the one they like least. The task is then repeated several times according to a statistical design. There is no scale-use bias. The tradeoffs respondents make during the MaxDiff exercise not only reveal their preference heterogeneity, but also provide much better differentiation between the concepts. However, MaxDiff results typically reveal only relative preferences—how well a concept performs relative to other concepts presented at the same time. The best-rated concept from one wave of testing might not, in fact, be any better than a concept that was rated as mediocre in a different wave of testing. How can we assure the marketing and product development teams that the best concept is truly a great concept rather than simply “the tallest pygmy”? Anchoring Converting the relative preferences from MaxDiff studies into absolute preference requires some form of anchoring. Several attempts have been made at achieving this. Orme (2009) tested 206 and reported the results from dual-response anchoring proposed by Jordan Louviere. After each MaxDiff question, an additional question was posed to the respondent: Considering just these four features . . . All four are important None of these four are important Some are important, some are not Because this additional question was posed after each MaxDiff task, it added significantly more time to the MaxDiff questionnaire. Lattery (2010) introduced the direct binary response (DBR) method. After all the MaxDiff tasks have been completed, respondents were presented with all the items in one list and asked to check all the items that appealed to them. This proved to be an effective way of revealing absolute preference—it was simpler and only required one additional question. Horne et al. (2012) confirmed these advantages of the direct binary method, but also demonstrated that it was subject to context effects; specifically, the number of items effect. If the number of items is different in two tests, for example, one has 20 items and another has 30 items, can we still compare the results? Making It Adaptive Putting aside the problem of anchoring for now, consider the MaxDiff experiment itself. MaxDiff exercises are repetitive. All the tasks follow the same layout, for example, picking the best and the worst amongst 4 options. The same task is repeated over and over again. The exercise can take a long time when a large number of items need to be tested. For example, if there are 20 items to test, it can take 12 to 15 tasks per respondent to collect enough information for modeling. As each item gets the same number of exposures, the “bad” items get just as much attention as the “good” items. Respondents are presented with subsets of items that seem random to them. They cannot see where the study is going (especially if they see something they have already rejected popping up again and again) and can become disengaged. The adaptive approach, first proposed by Bryan Orme in 2006, is one way to tackle this issue. Here is an example of a 20-item Adaptive MaxDiff exercise: The process works much like an athletic tournament. In stage 1, we start with 4 MaxDiff tasks with 5 items per task. The losers in stage 1 are discarded, leaving 16 items in stage 2. The losers are dropped out after each stage. By stage 4, 8 items are left, which are evaluated in 4 pairs. In the final stage, we ask the respondents to rank the 4 surviving items. Respondents can 207 see where the exercise is going. The tasks are different, and so less tedious; and respondents find the experience more enjoyable overall (Orme 2006). At the end of Adaptive MaxDiff exercise, the analyst has individual level responses showing which items are, and which are not, preferred. Because the better items are retained into the later stages and therefore have more exposures, the better items are measured more precisely than the less preferred items. This makes sense—often we want to focus on the winners. Finally, the results from Adaptive MaxDiff are consistent with the results of traditional MaxDiff (Orme 2006). Anchored Adaptive MaxDiff Anchored Adaptive MaxDiff combines the adaptive MaxDiff process with the direct binary anchoring approach. However, since the direct binary approach is sensitive to the number of items included in the anchoring question, we want to fix that number. For example, regardless of the number of concepts tested, the anchoring question may always display 6 items. Since this number is fixed, all waves must have at least that number of concepts for testing. Probably, most waves will have more. Which leads to this question: which items should be included in the anchoring question? To provide an anchor that is consistent between waves, we want to include items that span the entire spectrum from most preferred to least preferred. The Adaptive MaxDiff process gives us that. In the 20 item example shown previously, these items could be used in the anchoring: Ranked Best from Stage 5 Ranked Worst from Stage 5 Randomly select one of the discards from stage 4 Randomly select one of the discards from stage 3 Randomly select one of the discards from stage 2 Randomly select one of the discards from stage 1 None of the above The Anchored Adaptive MaxDiff approach has the following benefits: • • • • A more enjoyable respondent experience. More precise estimates for the more preferred items—which are the most important estimates. No “number of items” effect through the “controlled” anchoring question. Anchored (absolute) preference. The question remains whether the results from multiple Anchored Adaptive MaxDiff experiments are truly comparable to each other. 3. OUR EXPERIMENT While the actual concepts included in continuous testing programs vary, the format and the number of concepts tested are relatively stable. This allows us to set up the Adaptive MaxDiff exercise and the corresponding binary response anchoring question and use the same nearly identical structure for each wave of testing. 208 Ideally, the anchored MaxDiff methodology would be tested by comparing its results to an independent dataset of results obtained from scale ratings. That becomes very costly, especially given the multiple wave nature of the process we are interested in. To save money, we collected both sets of data at the same time. An Anchored Adaptive exercise was piggybacked onto the usual sequential monadic concept testing which contained the key purchase intent question. We tested a total of 126 concepts in 5 waves, approximately once every 6 months. The number of concepts tested in each wave ranged from 20 to 30. Respondents completed the Anchored Adaptive MaxDiff exercise first. They were then randomly shown 3 concepts for rating in a sequential monadic fashion. Each concept was rated on purchase intent, uniqueness, etc. Respondents were also asked to provide qualitative feedback on that concept. The overall sample size was set to ensure that each concept received 150 exposures. Wave 1 2 3 4 Number of concepts 25 20 21 30 5 Total 30 126 Field Dates Sample Size Spring 2011 Fall 2011 Spring 2012 Fall 2012 1,200 1,000 1,000 1,500 Spring 2013 1,500 6,200 Including 5 known “star” concepts With 2 "star" concepts repeated In wave 4, we included five known “star” concepts. These were concepts based on existing products with good sales history—they should test well. Two of the star concepts were then repeated in wave 5. Again, they should test well. More importantly, they should receive consistent results in both waves. The flow of the survey is shown in the following diagram: The concept tests were always done sequentially, with MaxDiff first and the sequential monadic concept testing questions afterwards. Thus, the concept test results were never “pure 209 and uncontaminated.” It is possible the MaxDiff exercise might have influenced the concept test results. However, as MaxDiff exposed all the concepts to all the respondents, we introduced no systematic biases for any given concept tested. The results of this experiment are shown here: The numbers on each line identify the concepts tested in that wave. Number 20 in wave one is different from number 20 in wave 2—they are different concepts with the same id number. The T2B lines are the top-2-box Purchase Intent, i.e., the proportion of respondents who rated the concept as “Extremely likely” or “Somewhat likely” to purchase and is based on approximately n=150 per concept for rating scale evaluation. The AA-MD lines are the Anchored Adaptive MaxDiff (AA-MD) results. We used Sawtooth Software CBC/HB in the MaxDiff estimation and the pairwise coding for the anchoring question as outlined in Horne et al. (2012). We plot the simulated probability of purchase, i.e., it is the exponentiated beta for a concept divided by the sum of exponentiated beta of that concept and exponentiated beta of the anchor threshold. The numbers are the average across respondents. It can be interpreted as the average likelihood of purchase for each concept. While the Anchored Adaptive MaxDiff results are generally slightly lower than the T2B ratings, there are many similarities. The AA-MD results have better differentiation than the purchase intent ratings. Visually, there is less bunching together, especially among the better concepts tested. Note that while we used T2B score here, we also did a comparison using the average weighted purchase intent. With a 5-point purchase intent scale, we used weighting factors of 0.7/0.3/0.1/0/0. The results were virtually identical. The correlation between T2B score and weighted purchase intent was 0.98; the weighted purchase intent numbers were generally lower. Below is a scatter plot of AA-MD scores versus T2B ratings. There is excellent consistency between the two sets of numbers, suggesting that they are essentially measuring the same 210 construct. Since T2B ratings can be used in aggregating results across waves, MaxDiff scores can be used for that as well. We should note the overlap of the concepts (in terms of preferences by either measure) among the 5 waves. All the waves have some good/preferred concepts and some bad/less preferred concepts. No wave stands out as being particularly good or bad. This is not surprising in a continuous concept testing environment as the ideas are simply tested as they come along without any prior manipulation. We can say that there is naturally occurring randomization in the process that divides the concepts into waves. We would not expect to see one wave with only good concepts, and another with just bad ones. This may be one of the reasons why AA-MD works here. If the waves contain similarly worthy concepts, the importance of anchoring is diminished. To examine the role of the anchor, we reran the results without the anchoring question in the MaxDiff estimation. Anchoring mainly helps us to interpret the MaxDiff results in an “absolute” sense so that Adaptive Anchoring-MaxDiff scores represent the average simulated probability of purchase. Anchoring also helps marginally in improving consistency with the T2B purchase intent measure across waves. Alpha with T2B (%) Purchase Intent n=126 concepts AA MD scores with Anchoring Without Anchoring 0.94 0.91 We combined the results from all 126 concepts. The T2B line in the chart below is the top-2box Purchase Intent and the other is the AA-MD scores. The table to the right shows the distribution of the T2B ratings and the AA-MD scores. 211 These numbers are used to set “norms” or action standards—hundreds of concepts are tested; only a few go forward. There are many similarities between the results. With T2B ratings, there is only a six percentage point difference between something that is good and something truly outstanding, i.e., 80th to 95th percentile. The spread is much wider (a 12 point difference) with AA-MD scores. The adaptive nature of the MaxDiff exercise means that the worse performing concepts are less differentiated, but that is of little concern. The five “star” concepts all performed well in AA-MD, consistently showing up in the top third. The results were mixed in T2B purchase intent ratings, with one star concept (#4) slipping below the top one-third and one star concept (#5) falling into the bottom half. When the same two star concepts (#3 and #4) were repeated in wave 5, the AA-MD results were consistent between the two waves for both concepts, while the T2B ratings were more varied. Star concept #4’s T2B rating jumped 9 points in the percent rank from wave 4 to wave 5. While these results are not conclusive given that we only have 2 concepts, they are consistent with the type of results we expect from these approaches. 212 4. DISCUSSION As previously noted, concepts are usually tested as they come along, without prioritization. This naturally occurring randomization process may be one of the reasons the Anchored Adaptive MaxDiff methodology appears to be free from the context effect brought on by testing the concepts at different times, i.e., waves. Wirth & Wolfrath (2012) proposed Express MaxDiff to deal with large numbers of items. Express MaxDiff employs a controlled block design and utilizes HB’s borrowing strength mechanism to infer full individual parameter vectors. In an Express MaxDiff setting, respondents are randomly assigned into questionnaire versions, each of which deals with only a subset of items (allocated using an experimental design). Through simulation studies, the authors were able to recover aggregate estimates almost perfectly and satisfactorily predicted choices in holdout tasks. They also advised users to increase the number of prior degrees of freedom, which significantly improved parameter recovery. Another key finding from Wirth & Wolfrath (2012) is that the number of blocks used to create the subset has little impact in parameter recovery. However, increasing the number of items per block improves recovery of individual parameters. Taking this idea to the extreme, we can create an experiment where each item is shown in one and only one of the blocks (as few blocks as possible), and there are a fairly large number of items per block. This very much resembles our current experiment with 126 concepts divided into 5 blocks, each with 20–30 concepts. If we pretend our data came from an Express MD experiment (excluding data from anchoring), we can create an HB run with all 126 items together using all 6,200 respondents. Using a fairly strong prior (d.f. =2000) to allow for borrowing across samples, the results correlate almost perfectly with what we obtained from each individual dataset. This again demonstrates the lack of any context effect due to the “wave” design. This also explains why we see only a marginal 213 decline in consistency between AA-MD score and T2B ratings when anchoring is excluded from the model. 5. “ABSOLUTE” COMPARISON & ANCHORING While we are satisfied that Anchored Adaptive MaxDiff used in the continuous concept testing environment is indeed free from this context effect, can this methodology be used in other settings where such naturally occurring randomization does not apply? Several conference attendees asked us if Anchored Adaptive MaxDiff would work if all the concepts tested were known to be “good” concepts. While we have no definitive answer to this question and believe further research is needed, Orme (2009b) offers some insights into this problem. Orme (2009b) looked at MaxDiff anchoring using, among other things, a 5-point rating scale. There were two waves of the experiment. In wave 1, 30 items were tested through a traditional MaxDiff experiment. Once that was completed, the 30 items were analyzed and ordered in a list from the most to the least preferred. The list was then divided in two halves with best 15 items in one and the worst 15 items in the other. In wave 2 of the study, respondents were randomly assigned into an Adaptive MaxDiff experiment using either the best 15 list or the worst 15 list. The author also made use of the natural order of a respondent’s individual level preference expressed through his Adaptive MaxDiff results. In particular, after the MaxDiff tasks, each respondent was asked to rate, on a 5-point rating scale, the desirability of 5 items which consisted of the following: Item1: Item2: Item3: Item4: Item5: 214 Item winning Adaptive MaxDiff tournament Item not eliminated until 4th Adaptive MaxDiff round Item not eliminated until 3rd Adaptive MaxDiff round Item not eliminated until 2nd Adaptive MaxDiff round Item eliminated in 1st Adaptive MaxDiff round Since those respondents with the worst 15 list saw only items less preferred, one would expect the average ratings they gave would be lower than those respondents who saw the best 15 list. However, that was not what the respondents did. Indeed the mean ratings for the winning item were essentially tied between the two groups. Winning MaxDiff Item Item Eliminated in 1st Round N=115 N=96 Worst15 Best15 3.99 3.98 2.30 3.16 (Table 2—Orme 2009b) This suggested that respondents were using the 5-pt rating scale in a “relative” manner, adjusting their ratings within the context of items seen in the questionnaire. This made it difficult to use the ratings data as an absolute measuring stick to calibrate the Wave 2 Adaptive MaxDiff scores and recover the pattern of scores seen from Wave 1. A further “shift” factor (quantity A below) is needed to align the best 15 and the worst 15 items. (Figure 2—Orme 2009b) In another cell of the experiment, respondents were asked to volunteer (open-ended question) items that they would consider the best and worst in the context of the decision. Respondents were then asked to rate the same 5 items selected from their Adaptive MaxDiff exercise along with these two additional items on the 5-point desirability scale. 215 N=96 N=86 Worst15 Best15 Winning MaxDiff Item 3.64 3.94 Item Eliminated in 1st Round 2.13 2.95 (Table 3—Orme 2009b) (Figure 3—Orme 2009b) Interestingly, this time the anchor rating questions (for the 5-items based on preferences expressed in the Adaptive MaxDiff) yielded more differentiation between those who received the best 15 list and those with the worst 15 list. It seems that asking respondents to rate their own absolute best/worst provided a good frame of reference so that 5-pt rating scale could be used in a more “absolute” sense. This effect was carried through into the modeling. While there was still a shift in the MaxDiff scores from the two lists, the effects were much less pronounced. We are encouraged by this result. It makes sense that the success of anchoring is directly tied to how the respondents use the anchor mechanism. If respondents were using the anchoring mechanism in a more “absolute” sense, the anchored MaxDiff score would be more suitable as an “absolute” measure of preferences, and vice versa. Coincidentally, McCullough (2013) contained some data that showed that respondents do not react strictly in a “relative” fashion to direct binary anchoring. The author asked respondents to select all the brand image items that would describe a brand after a MaxDiff exercise about those brand image items and the brands. Respondents selected many more items for the two existing (and well-known) brands than for the new entry brand. 216 Average number of Brand Image Item Selected Brand #1 Brand #2 New Brand 4.9 4.2 2.8 Why are respondents selecting more items for the existing brands? Two issues could be at work here: 1. There is a brand-halo effect. Respondents identify more items for the existing brands simply because the brands themselves are well known. 2. A well-known brand would be known for more brand image items, and respondents are indeed making “absolute” judgment when it comes to what is related to that brand and are making more selections because of it. McCullough (2013) also created a Negative DBR cell where respondents were asked the direct binary response question for all the items that would not describe the brands. The author found that the addition of this negative information helped to minimize the scale-usage bias and removed the brand-halo effect. We see clearly now that existing brands are indeed associated with more brand image items. (McCullough 2013) While we cannot estimate the exact size of the effects due to brand-halo, we can conclude that the differences we observed in the number of items selected in the direct binary response question are at least partly due to more associations with existing brands. That is, respondents are making (at least partly) an “absolute” judgment in terms of what brand image items are associated with each of the brands. We are further encouraged by this result. To test our hypothesis that respondents make use of direct binary response questions in an “absolute” sense, we set out to collect additional data. We asked our colleagues from Angus Reid Global (a firm specializes in public opinion research) to come up with two lists: list A included 6 items that were known to be important to Canadians today (fall of 2013) and list B included 6 items that were known to be less important. 217 List A Economy Ethics / Accountability Health Care Unemployment Environment Tax relief List B Aboriginal Affairs Arctic Sovereignty Foreign Aid Immigration National Unity Promoting Canadian Tourism Abroad We then showed a random sample of Angus Reid Forum panelists one of these lists (randomly assigned) and asked them to select the issues they felt were important. Respondents who saw the list with known important items clearly selected more items than those respondents who saw the list with known less important items. Number of Items Identified as Important Out Of 6 Items List A List B Most Important Items Less Important Items n= 505 507 Mean 3.0 1.8 Standard Deviation 1.7 1.4 p-value on the difference in means <0.0001 This result is encouraging because it indicates that respondents are using the direct binary anchoring mechanism in an “absolute” sense. This bodes well for using Anchored Adaptive MaxDiff results for making “absolute” comparisons. 6. CONCLUSIONS Many companies perform continuous concept test using sequential monadic rating. MaxDiff can be a superior approach, as it provides better differentiation between the concepts tested. Using Adaptive MaxDiff mitigates the downside of MaxDiff from the respondents’ perspective, improving their engagement. The problem with using MaxDiff in a continuous testing environment is that MaxDiff results are relative, not absolute. Therefore, some form of anchoring is needed to compare results between testing waves. Our results demonstrate that using Adaptive MaxDiff with a direct-binary anchoring technique is a feasible solution to this problem. A promising field for further research is to examine waves that differ significantly in the quality of the concepts they are testing. However, we have some evidence that respondents use the direct binary anchoring in a suitably “absolute” sense. Given the importance of separating the wheat from the chaff at a relatively early (and inexpensive) stage of the product development process, we believe that this approach is an important step forward for market research. 218 Rosanna Mau Jane Tang REFERENCES Chrzan, K. & Golovashkina, N. (2006), “An Empirical Test of Six Stated Importance Measures” International Journal of Market Research Vol. 48 Issue 6 Horne, J., Rayer, B., Baker, R., & Lenart, S. (2012) “Continued Investigation into the Role of the ‘Anchor’ In Maxdiff and Related Tradeoff Exercises” Sawtooth Software Conference Proceedings Lattery, K. (2011) “Anchoring Maximum Difference Scaling Against a Threshold—Dual Response and Direct Binary Responses” Sawtooth Software Technical Paper Library McCullough, P. R. (2013) “Brand Imagery Measurement: A New Approach” Sawtooth Software Conference Proceedings Orme, B. (2006), “Adaptive Maximum Difference Scaling” Sawtooth Software Technical Paper Library Orme, B. (2009), “Anchored Scaling in MaxDiff Using Dual Response” Sawtooth Software Technical Paper Library Orme, B. (2009b), “Using Calibration Questions to Obtain Absolute Scaling in MaxDiff” SKIM/Sawtooth Software Conference Wirth, R. & Wolfrath, A. (2012) “Using MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of a New Approach” Sawtooth Software Conference Proceedings 219 HOW IMPORTANT ARE THE OBVIOUS COMPARISONS IN CBC? THE IMPACT OF REMOVING EASY CONJOINT TASKS PAUL JOHNSON WESTON HADLOCK SSI BACKGROUND Predicting consumer behavior with relatively high accuracy is a fundamental necessity for market researchers today. Multi-million dollar decisions are frequently based on the data collected and analyzed across a vast number of differing methodologies. One of the more prominent methods used today involves the study of choice via conjoint exercises (Orme, 2010). Conjoint analysis asks respondents to make trade-off decisions between different levels of attributes. For example, a respondent could be asked if they prefer to see a movie the evening of its release for $15 or wait a week later to see it for $10. However, many computer-generated conjoint designs will include some comparisons where you can get the best of both worlds, such as comparing opening night for $10 to a week later for $15. These easier comparisons contain a dominated concept where our prior expectations tell us that nobody is going to wait a week in order to pay $5 more. Choice tasks with dominated concepts are easy for respondents, but we learn little from them. The dominated concept is rarely selected because respondents do not need to sacrifice anything by avoiding it. Theoretically, showing these dominated concepts takes respondent time, without providing much additional benefit because we believe they will not select the dominated concept. Ideally each of the concepts presented should appeal to the average respondent about equally (similar utility scores), but just in different ways. Utility balancing will allow us to distinguish between the preferences of each individual rather than seeing everyone avoiding the obviously inferior concepts. Some efforts have been made to incorporate utility balancing in randomized conjoint designs. One option is to specify prior part-worth utilities for the products shown and define allowable ranges for utilities in a single task. This approach is rarely used because in most cases the exact magnitude of the expected utility of each attribute level is not known, so the a priori utilities of the products cannot be calculated. Prohibitions are another way to limit dominated concepts. Keith Chrzan and Bryan Orme saw theoretical efficiency gains when using corner prohibitions with simulated data, but prohibitions in general can be dangerous (Orme & Chrzan, 2000). Conditional and summed pricing are more effective ways to adjust price to avoid these dominated concepts by forcing an implicit tradeoff with price (Orme, 2007). While they are effective at reducing the number of dominated concepts found in a design, none of these techniques are 100% effective at making sure that they do not appear anywhere in the design. With these things in mind, we examine an alternative application to balancing the product utility in a conjoint task. While the exact magnitude of the a priori part-worth utilities are not normally known, it is common to know inside each attribute the a priori order of the part-worth utilities. We identify tasks containing dominated concepts (referred to as easy tasks) by comparing the attribute levels of products inside each task; if any product is equal to or superior 221 to another product on all attributes with an a priori order then it is considered an easy task. These easy tasks are then replaced with a new conjoint task. Theoretically, avoiding these dominated concepts in a standard randomized design would achieve similar hit-rate percentages with fewer tasks. However, we theorize based on our own experience that both designs will have similar hit rates. We also hypothesize that eliminating the easy tasks will increase the difficulty of the task and respondents will take more time to complete each task. Due to this increased difficulty, we thought the respondent experience would be negatively impacted. METHODS To maintain high incidence and control costs we sampled 500 respondents from SSI’s online panels after targeting for frequent movie attendance (at least once a month). The conjoint probed the preferences for different attributes in the movie theater experience ranging from the seating to the concessions included in a bundling option. We balanced on gender and age for general representativeness of the United States. Once respondents qualified for the study they were randomly shown one of two conjoint studies: balanced overlap design (control) or the same design with all easy tasks removed (treatment). We used the balanced overlap design as our control because it is the default of the Sawtooth Software and widely used in the industry. The attributes and levels tested in the design are shown below in the a priori preference from worst to best. Note that these prior expectations may not prove correct. For example we assumed that having one candy, one soda per person and a group popcorn bucket was better than just having one candy and one soda per person. However it could be that some people do not like popcorn and would actually prefer going without that additional item. When making these prior expectation assumptions it is important to keep in mind that preferences can be very heterogeneous and what might be a dominated concept for one person might not be a dominated concept for another. Table 1. Design Space Attribute New Release Food Included Seating Minimum Purchase 222 Levels Opening Day/Night Opening Week/Weekend After Opening Week 1 Candy 1 Soda per person Group popcorn bucket 1 Candy 1 Soda per person Group popcorn bucket 1 Candy per person No Package Provided Choose Your Own/Priority Seating General Admission Seating None 3 tickets 6 tickets Movie Type Drive Time Price 3D Standard 2D Fewer than 5 minutes 5–10 minutes 10–20 minutes 20–30 minutes Over 30 minutes $8.00 $8.50 $9.00 $9.50 $10.00 $10.50 $11.00 $11.50 $12.00 $12.50 $13.00 $13.50 $14.00 The treatment design used the balanced overlap design as a base. It had 300 versions with 8 tasks in each totaling 2,400 tasks. Each task showed four possible movie packages and a None option in a fifth column. After exporting the design into Microsoft Excel, we wrote a formula that searched for tasks with dominated concepts. It identified 562 easy tasks (23.4% of the total tasks) within the original design. We removed these tasks from the design and renumbered the versions and tasks to keep it consistent with eight tasks shown in each version. For example, if version 1 task 8 was an easy task we changed version 2 task 1 to the new version 1 task 8 and version 2 task 2 became the new version 2 task 1. After following this process for the entire design matrix, we were left with 229 complete versions for the treatment design. We ran the diagnostics on both designs and they both had high efficiencies (the lowest was .977), so we felt comfortable that both designs were robust enough to handle the analysis. After the random tasks were shown to each respondent, we showed three in-sample holdout tasks. These holdout tasks mimicked the format of the random tasks with four movie options and a traditional none option, but they represented realistic purchasing scenarios that could be seen in the market without any dominated concepts. We used these tasks to measure hit rates under differing conditions. No out-of-sample holdout tasks were tested. We ran incremental HB utilities on each data set starting with one task and going to all eight tasks being included in the utility calculations. These utilities test how well each design performs using fewer tasks to predict the holdout tasks. We also examined how variable the average part-worth utilities were in each design as more tasks were added. Theoretically, the treatment design should have higher hit rates and more stable utilities with fewer tasks because it doesn’t waste time collecting information on the easy tasks. 223 At the end of the survey, respondents were asked two questions on a 5-point Likert scale about their survey experience: “How satisfied were you with your survey experience?” and “How interesting was the topic of the survey to you?” These are the standard questions SSI uses to monitor respondents’ reactions to surveys they complete. Lastly the time taken on each of the random tasks was recorded to see if the difficult tasks required more time. RESULTS Design Performance with Fewer Tasks The hit rate percentages were moderately higher in both designs as more tasks were included. The holdouts where none was selected were not excluded and were only counted as a hit if the model also predicted that they would select none. Because of the nature of the holdout tasks we would expect a hit rate somewhere between 20% (5 options with the none) and 25% (4 options without the none) just by chance. In each design the hit rate percentage does increase, showing that with more information in general we see better predictive performance. In one holdout task, we see a significant lift in the predictive performance of the treatment design (Figure 1). However, once we included at least five tasks in the utility calculations there is no difference between the two designs. Also, even with very few tasks we do not see a significant lift in the predictive performance of the treatment design on the second tasks (Figure 2). The third holdout was between the first and the third when there was an incremental, but not statistically significant lift in the holdout prediction with less than five tasks (Figure 3). While there is an indication that the treatment design can produce a slight gain in predictive ability it isn’t consistent in other holdout tasks. When we examined the average utilities, we looked to see if the average utilities were about the same with few tasks as they were with all 8 tasks. We found that for some attributes the control design estimated better with fewer tasks (meaning that the average utilities with only 4 tasks were closer to the average utilities with all 8 tasks), but for other attributes the treatment design estimated better. There was no significant improvement in the average utilities by the treatment design. 224 Figure 1. Holdout 1 Hit Rate Percentages by Design and Number of Tasks Figure 2. Holdout 2 Hit Rate Percentages by Design and Number of Tasks 225 Figure 3. Holdout 3 Hit Rate Percentages by Design and Number of Tasks Design Completion Times Overall both designs took comparable amounts of time to complete the task (Figure 4). The first tasks average around 45 seconds and by the time the respondents get the hang of the exercise they are taking tasks four to eight in a little under 20 seconds each. However, the pattern in the control group is odd. It doesn’t follow the normal smooth trend of a learning curve. This can be explained by separating out the difficult and the easy tasks in the control design. The easy tasks in the control design take significantly less time when seen in the first through the fourth task (Figure 5). The combination of these two types of tasks could be what is producing the time spike on task three for the control group. Figure 4. Completion Time by Task and Design Type 226 Figure 5. Completion Time by Task and Design Type Design Satisfaction Scores Respondents seemed not to mind the increased difficulty of the treatment design. In fact there are small indications that they enjoyed the more difficult survey. The top two box on both the satisfaction and interesting questions were 8% higher in the treatment design (Figure 6). However, when a significance test is done on the mean, the resulting t-test does not show a statistically significant increase in the overall mean (Table 2). Another possible explanation is that the utility balanced group had a slightly smaller percentage of respondents rating their survey experience (73% versus 78%) which could have either raised or lowered their satisfaction scores. In the end there might be some indications of better survey experience, but not enough to conclude that the treatment design produced a better experience. 227 Figure 6. Satisfaction Ratings by Design Type Table 2. T-test of Mean Satisfaction Ratings t-Test: Two-Sample Assuming Unequal Variances How satisfied were you with your survey experience? Mean Variance df t Stat P(T<=t) one-tail Control 4.07 0.67 Utility Balanced 4.11 0.61 377 -0.53 0.30 How interesting was the topic of the survey to you? Utility Control Balanced 4.02 4.13 0.71 0.66 377 -1.24 0.11 DISCUSSION AND CONCLUSIONS In general the treatment design which removed dominated concepts performed about the same as the standard balanced overlap design. While there was a slight lift in the predicted holdouts when you have sparse information for the utilities (less than five tasks for this specific design space), both designs had similar predictive capabilities. The end results of the utilities were comparable at the aggregate level even with sparse information. The easy tasks clearly took less time to complete, but the difference went away as the respondents became accustomed to the task in general. Lastly, there seems to be slight evidence that removing dominated concepts can increase the interest and satisfaction in the survey, but not enough to statistically move the mean rating. There are other reasons for removing these dominated tasks. The most common is when a client encounters one and rightly questions the reason for making these comparisons which seem trivial and obvious to them. This feedback suggests another possibility of just not showing dominated concepts. Hiding the dominated concepts in standard choice-based concepts would essential automatically code the dominated concept as inferior to whatever concept the 228 respondent selected. Future research needs to be done on the effects of this type of automated design adjustment. In this instance the a priori assumptions were largely correct. Dominated concepts were chosen on average 6% of the time they were shown (one fourth of the rate you would expect to see by chance). They were still selected sometimes which can indicate people who have more noise in their utilities or people who might even buck the trend and actually prefer inferior levels. For example, there could be a significant number of people who would prefer a 2D movie to a 3D movie. Imposing these assumptions and removing dominated concepts is a risk that is taken should the assumptions not be correct. Paul Johnson Weston Hadlock REFERENCES Burdein, I. (2013). Shorter isn’t always better. CASRO Online Conference. San Francisco, CA. Orme, B. (2007). Three ways to treat overall price in conjoint analysis. Sawtooth Software Research Paper Series, Retrieved from http://www.sawtoothsoftware.com/download/techpap/price3ways.pdf Orme, B. (2010). Getting started with conjoint analysis: Strategies for product design and pricing research. (2nd ed.). Madison, WI: Research Publishers LLC. Orme, B., & Chrzan, K. (2000). An overview and comparison of design strategies for choicebased conjoint analysis. Sawtooth Software Research Paper Series, Retrieved from http://www.sawtoothsoftware.com/download/techpap/desgncbc.pdf 229 SEGMENTING CHOICE AND NON-CHOICE DATA SIMULTANEOUSLY THOMAS C. EAGLE EAGLE ANALYTICS OF CALIFORNIA Segmenting choice and non-choice data simultaneously refers to segmentation that combines the multiple records inherent in a choice model estimation data set (i.e., the choices across multiple tasks) with the single observation related to the non-choice data (e.g., behaviors, attitudes, ratings, etc.) and derives segments using both data simultaneously. This is different from conducting a sequential approach to segmenting these data (i.e., fitting a set of individuallevel utilities first and then combining those with the non-choice data). It is also different from repeating the non-choice data across the choice tasks data. The method can account for the repeated nature of the choice data and the single observation of the non-choice data within a latent class modeling framework. Latent Gold’s advanced syntax module enables one to conduct such an analysis. The original intent of this paper was to show how a researcher could conduct a segmentation using both choice and non-choice data simultaneously without using a sequential approach: that is, fitting a Hierarchical Bayes MNL model first, combining the resulting individual level attribute utilities with the non-choice data, and subjecting that to a segmentation method such as K-Means or Latent Class modeling. Over time, the objective evolved into demonstrating why segmenting derived HB MNL utilities can be problematic. If the objective of the research dictates the use of a method designed specifically for that objective, then clearly one should use that method. However, in some cases the method has to be adapted. The situation of segmenting choice and non-choice data has been problematic because it there has not been a simple way to perform this simultaneously. Most practitioners, myself included, have resorted to fitting HB MNL models first, adding the non-choice data to the resulting utilities, and then conducting the segmentation analyses, simply because that seemed to be the only way to go. In addition to discussing why that approach is suspect, this paper demonstrates how to perform the segmentation properly within a latent class framework. The paper uses a simple contrived simulation data set to show why sequentially segmenting HB utilities with non-choice data has issues. We also show how to perform the analyses within a latent class framework without resorting to fitting HB utilities first. A real-world example is also briefly discussed in this paper. WHY WORRY ABOUT USING HB DERIVED CHOICE UTILITIES IN SEGMENTATION? Individual-level utilities derived from an HB MNL model are continuous, random variables with values determined by the priors and the data themselves. Using Sawtooth Software (and most other standard software) to fit the HB MNL model, the priors assume a multivariate normal variance-covariance matrix. The means of the individual-level utilities are determined by a multivariate regression model (the “upper model”) that assumes this multivariate normality and, in the simplest case, a single intercept to model the mean. Adding covariates to the upper model enables the model to estimate different upper-level means given the covariate pattern. Default settings for the degrees of freedom, the prior variance, the variance-covariance matrix, and the 231 upper-level means generally assume we do not know much about these things—they are “uninformative” priors. All this leads to utilities that tend to regress to the population mean, are continuously distributed across respondents, and can mask or blur genuine gaps in the distribution of the true parameters that may underlie the data. Without knowledge to appropriately set the priors we have the potential to see a smoothing of the utilities that can be problematic for segmentation. Another problem in the use of HB utilities in segmentation is that we are using mean utilities as if they are fixed point estimates of the respondents’ utilities. The very nature of the HB model is that the mean utilities are means of a “posterior distribution” reflecting our uncertainty about the respondent’s true utilities. Using the point estimates as though they are fixed removes the very uncertainty that Bayesian models so nicely capture and account for. Should we do that when conducting segmentation? Another issue is that, as Sawtooth Software recommends, and as this paper will show, one should rescale the derived HB utilities before using the sequential approach. However, the degree of rescaling can affect the segmentation results in terms of the derived number of optimal segments. The reason for rescaling is that the utilities of one respondent cannot be directly compared to another respondent because of scale differences (i.e., the amount of uncertainty in a respondent’s choices). This paper will show not only that rescaling is required even with data where the amount of uncertainty is exactly the same across all respondents, but also that the degree of rescaling may affect the results. A final issue is that segmenting derived HB utilities in a K-means-like clustering, hierarchical clustering, or latent class clustering method which assumes no model structure to the segmentation is not the same thing as conducting segmentation where an explicit model structure, such as a choice model, is imposed. If one’s objective is segmenting a sample of respondents on the basis of a choice model, why use a method that does not employ that model structure to segment the respondents? Why resort to using a method that is not model-based when such a model (a latent class choice model, in particular) needs fewer assumptions about the distributions of the data than does the typical HB choice model used by practitioners? If one’s objective is segmentation, the question is, why would one want to subject the data to any of this if they do not have to do so? LATENT CLASS SEGMENTATION OF BOTH CHOICE AND NON-CHOICE DATA Latent Gold’s advanced syntax module has the capabilities of conducting the segmentation analyses using choice and non-choice data. Another software package, MPLUS (available from Statmodel.com), has these capabilities when one is able to “trick” the program to handle the multinomial logit model (MPLUS does not have the built-in capability to fit the classic McFadden MNL model). The key to conducting this simultaneous segmenting of choice and non-choice data is in the general capabilities of Latent Gold, the syntax itself, and the structure of the data file. Figure 1 depicts a snippet of the Latent Gold syntax. 232 Figure 1: A snippet of code from Latent Gold choice syntax The dashed boxes identify the syntax that deal with the non-choice variables. The choice modeling syntax is bolded. A more complete example with the appropriate data structure is given in the Appendix. SIMULATION EXAMPLES To examine the impact of segmenting HB derived choice utilities we build a simple simulation of a MaxDiff Task. The MaxDiff task consists of 6 items shown in a balanced incomplete block design (BIBD) of 12 tasks each with 3 items appearing in each task. Each item is seen 6 times and each is seen with every other equally often. We create 4 segments with the known utilities show in Table 1. Table 1: Actual segment utilities used to generate individual utilities Actual Seg 1 Seg 2 Seg 3 Seg 4 Item 1 2 -4 -2 4 Item 2 4 2 -4 -2 Item 3 -4 -2 4 2 Item 4 -2 4 2 -4 Item 5 3 1 -3 -1 Item 6 0 0 0 0 Total 100 100 100 100 Seg 3 is Seg 1 flipped; Seg 4 is Seg 2 flipped The utilities for 100 respondents per segment were constructed from the above by generating individual-level parameters from a univariate normal deviate with mean 0 for each item. The standard deviation used for the normal deviate varied as described in the cases below. The raw choices for each task for each respondent were generated using each respondent’s generated utilities and adding a Gumbel error with a scale of 1.0 (i.e., the error associated with each total utility for a task was: -ln(-ln(1-uniform draw[0,1])) * 1.0 (the Gumbel error scale) for the best choice and +ln(-ln(1-uniform draw[0,1])) * 1.0 for the worst choice). These data were generated using SAS. The non-choice data was generated using the simulation capabilities of MPLUS. The following non-choice data were generated for 4 segments (Table 2). 233 Table 2: Non-Choice Distribution Across and Within Segments Variable X1 X2 X3 Attitude 1 Attitude 2 Attitude 3 Value\Size 0 1 0 1 0 1 Mean (7 point) Mean (7 point) Mean (10 point) Seg 1 95 94% 6% 95% 5% 96% 4% 6.4 1.9 3.0 Seg 2 107 6% 94% 93% 7% 53% 47% 2.6 5.4 8.0 Seg 3 103 95% 5% 3% 97% 4% 96% 2.4 1.9 8.1 Seg 4 95 96% 4% 3% 97% 49% 51% 5.5 5.4 3.1 Several simulations (Cases) are discussed in the paper: 1) Highly Differentiated—Raw: A simulation where the actual choice data is highly concentrated within a segment, but highly differentiated across segments, and where the non-choice data is NOT included. The standard deviation used to generate the known utilities was 0.33. 2) Highly Differentiated—Aligned: The same choice data as #1 above, but with the nonchoice data mapped one-to-one to the choice data. That is, segment 1 data for the nonchoice data is aligned to segment 1 of the choice data. 3) Highly Differentiated—Random: The same choice data as #1 above, but the non-choice data is randomly assigned to the choice data. In other words, the choice data and nonchoice data do not share any common underlying segment structure in this case. 4) Less Differentiated—Aligned: Data where the generated choice utilities are not as concentrated within segments and less differentiated across segments as a result. The standard deviation used to generate the known utilities was 1.414. The non-choice data is aligned to the choice segments as in Case 2. Figure 2 below depicts the actual individual-level choice utilities generated for the Highly Differentiated simulations (they were the same for Cases 1–3 above). Figure 2 is for items 1 and 2 only, but plots for other pairs of items showed a similar pattern of highly differentiated segments of respondents. Figure 3 (also below) depicts the individual-level utilities generated for the Less Differentiated (Case 4 above) simulation. Again, other pairs of the items show similar patterns in the individual-level utilities. 234 Figure 2: Generated Individual Utilities and Segment Centroids for Items 1 & 2— Highly Differentiated ESTIMATION We used Sawtooth Software’s CBC HB program to estimate the individual-level utilities using their recommended settings for fitting MaxDiff choice data. These include: a prior variance of 2; prior degrees of freedom of 5; and a prior variance-covariance matrix of 2 on the diagonals and 1 on the off-diagonal elements. We used 50,000 burn-in iterations and our posterior mean utilities are derived from 10,000 iterations subsequent to the burn-in iterations. 235 Figure 3: Generated Individual Utilities and Segment Centroids for Items 1 & 2— Less Differentiated We used Latent Gold 4.5 to segment all simulation data sets. We used the Latent Gold syntax within the choice model advanced module to simultaneously segment the choice and non-choice data. When segmenting the derived HB utilities we used Latent Gold’s latent class clustering routine. This is equivalent to using K-means when all variables are continuous (see Magidson & Vermunt, 2002). We investigate three ways of estimating and rescaling the HB utilities: 1) raw utilities as derived from using the Sawtooth Software recommended settings for fitting MaxDiff data; 2) utilities derived by setting the prior degrees of freedom so high as to “pound” the utilities into submission (“hammering the priors”); and 3) rescaling the raw (unhammered) utilities to have a range from the maximum to minimum utility of 10 (Sawtooth Software recommends a value of 100). It must be noted that latent class clustering is not the same as latent class choice modeling. Latent class clustering is equivalent to trying to minimize a distance metric among observations within a segment while maximizing the differentiation between segments. It is applied to the estimated HB choice utilities like it would be to any other set of numbers, without any knowledge of what they are or where they came from. Latent class choice modeling imposes a model structure on the creation of segments, in this case, the MNL model itself. It finds a set of segments that maximizes the differences in utilities across segments given the model structure. 236 Because latent class clustering has no model structure equivalent to latent class choice modeling, this leads to differences in the results. CASE 1: HIGHLY DIFFERENTIATED—RAW RESULTS Using the usual BIC (Bayesian Information Criterion) to determine the best number of segments, the optimal solution for the latent class MNL model produced 4 segments. The centroids of the derived segments are almost exactly the same as those used to generate the data. Because the locations are hidden in Figure 4, the actual coordinates are given in Table 3. Figure 4 depicts the actual MNL utilities, the derived HB MNL utilities, the centroids of the 4 true segments, the centroids for the 4 segment latent class clustering using the derived HB utilities, and the centroids of the BIC-optimal 10 segments from the latent class clustering, all for items 1 & 2. The larger black plus signs are the centroids of the latent class choice model segments (these are not clearly visible so their values are provided in Table 3). The larger black stars are the centroids for the BIC-optimal 10 segments from the latent class clustering on the derived HB utilities. The smaller black crosses represent the locations of the HB derived individual-level utilities. The smaller gray dots are the true underlying utilities. Figure 4: Results from the latent class clustering of the derived HB utilities 237 Table 3: Derived segment utilities (centroids) for latent class MNL model Actual Seg 1 Seg 2 Seg 3 Seg 4 Item 1 1.9 -3.7 -1.9 4.0 Item 2 3.5 2.0 -3.9 -1.7 Item 3 -4.1 -1.9 4.0 2.0 Item 4 -2.1 3.8 1.9 -3.7 Item 5 2.8 1.0 -2.9 -.8 Item 6 0 0 0 0 Total 100 100 100 100 There are two things to notice in Figure 4: 1) there is an increase in the variance, or spread, of the HB derived utilities (small black crosses) compared to the actual utilities (small gray dots); and 2) correspondingly, the optimal 10 latent class clustering segment centroids (large black stars) have spread out away from actual segment centroids (large black plus signs). Also plotted are the centroids of the 4 segment solution from the latent class clustering on the HB derived utilities (large black crosses). Interestingly, that 4 class solution produces perfect classification of the respondents into the 4 underlying true segments (large black plus signs). And, the general location of the 4 latent class clustering centroids is closer to the actual segments than are the centroids of the BIC-optimal 10 segment solution. However, the researcher working with this data would have no way of knowing the correct number of segments and would likely begin their evaluation with far more segments than what actually generated these data. There are several reasons why we see these results. The derived HB utilities are not scaled into the same space as the actual utilities.1 This is likely the result of using the uninformative priors that most practitioners assume when fitting MaxDiff or any choice model within HB. There are also no upper model covariates trying to differentiate the segments. Finally, there is a qualitative difference in the modeling approach. Latent class clustering is not the same thing as latent class choice modeling which has the MNL model embedded in its derivation of segments. CASE 2: HIGHLY DIFFERENTIATED—ALIGNED RESULTS The latent class choice model segmentation produced the 4 known segments almost exactly. The results are not different from Case 1. The location and spread of the derived HB utilities are exactly the same between the Case 1: Raw simulation and those of the Case 2: Aligned simulation (choice and non-choice data)—as you might expect given the HB derived utilities are the same. The latent class clustering of the HB derived utilities with the non-choice data produced a BIC-optimal 8 segments. The addition of the non-choice data did improve the previous BIC-optimal solution. The new solution collapsed 2 of the previous segments because of the non-choice data. These results are NOT shown in a Figure because the graphics do not reveal any additional insights than what we saw in Case 1. The latent class clustering with the non-choice data aligned one-to-one with the choice data strengthens the 4 segment pattern of differentiation created purposively in the data generation process by reducing the number of BIC optimal segments. As a result, the next set of results have fewer optimal segments than using the derived HB utilities alone. All further Highly 1 In this data we could fit individual-level MNL MaxDiff models using classical aggregate MNL estimation. The utilities derived from this estimation are far more extreme than those we see from the HB MNL estimation! This is indicative of the regression to the mean properties of the HB MNL model. The author can provide these estimates upon request. 238 Differentiated results, including those examining the impact of different ways of handling the derived HB utilities, are presented using the Case 2: Aligned with its non-choice data included. To reduce the spread of the HB derived utilities we increased the prior degrees of freedom from the recommended value of 5 to 400, the same value as the sample size. This “hammering the priors” should reduce the spread in the HB derived utilities. Figure 5 shows that the spread was indeed reduced. The number of optimal segments in the latent class clustering solution fell from 10 to 5 and the segment centroids were much closer to the latent class MNL choice model with aligned non-choice data segmentation centroids. Again, the 4 segment solution was spot on in classification. The only difference now is that the 5 HB-based BIC-optimal segments split segment 3 into 2 segments differing only with respect to extremes. A researcher’s examination of the mean segment utilities would likely drive the research towards the 4 segment solution. Figure 5: Highly Differentiated—Aligned; Priors Hammered The problem with this result is that the researcher does not know when to “hammer the priors.” Should we always do so? In our simulations, we have the unfair advantage of knowing the actual utilities—a real-world practitioner does not. In addition, the hope that the non-choice data is aligned so perfectly with the choice data, as is the case in Figure 5, is unrealistic. In the real-world it is more likely that there is less alignment seen. Another approach to segmenting choice utilities is to rescale the utilities. Several methods of rescaling exist, including rescaling the utilities so that the range in utilities for each individual 239 respondent is a constant (e.g., Sawtooth Software recommends a range of 100), or exponeniating the utilities and rescaling them to a function of the number of items in each task. In Figure 6 below the utilities have been rescaled to have a constant range of 10, which is more in line with the actual utilities. In order to make the actual utilities comparable on the graphic, they were also rescaled to the same range of 10. Keep in mind, however, that the practitioner would not know to what range value to rescale. The plots of the derived HB utilities are almost identical to the rescaled actual utilities. There is a stretching of the points along different axes, but this is due to the nature of the simulation’s generation of these utilities across all 6 items. Selecting different actual utilities would have led to a more concentric pattern of utilities. Figure 6: Highly Differentiated—Aligned; Rescaled Utilities (Range = 10) These results have 7 segments based upon the BIC.2 The segments cluster around the 4 actual segment centroids and only differentiate extremes within the clustering of derived HB utilities. Ideally the researcher would notice this and decide to reduce the number of segments to the 4 obvious segments. As noted earlier, the aligned non-choice data is helping drive the number of segments lower than in the choice-only data segmentation we saw earlier. However, the rescaled HB utilities are resulting in segments that are differentiated only in terms of extremes (i.e., the 2 Note the actual or predicted LC segments are not shown. Their locations have not substantially changed from Figure 5. 240 difference among the segments is only in terms of 1 or 2 items having a higher mean utility). This might be exacerbated if we had rescaled the utilities to a range of 100, as that would increase their effect on the clustering distance measures vs. the non-choice data. Rescaling the utilities is important, but the rescaling value can significantly influence the number of segments. Profiling the segments on the choice and non-choice data makes this pattern clear. Figure 7 depicts the rescaled segment mean utilities for the 4 segment latent class choice model and the rescaled 7 segment mean HB derived utilities latent clustering segments. Clearly, there is only a separation of segments on rescaled utilities extremes for 1 or 2 items that differentiate the HB segments. A similar pattern arises in the comparison of the non-choice profiles as seen in Figures 8 and 9 below. Figure 7: Highly Differentiated—Aligned; Rescaled Mean Segment Utilities 241 Figure 8: Highly Differentiated—Aligned; Profile of Nominal Non-Choice Variables Figure 9: Highly Differentiated—Aligned; Profile of Continuous Non-Choice Variables The boxes surrounding pairs of segments in the latent clustered HB-rescaled segmentation are those most similar to each other. Examination of these segments reveal that the segments paired together only differ on the magnitude of 1or 2 non-choice variables and 1 or 2 MaxDiff items. CASE 3: HIGHLY DIFFERENTIATED—RANDOM RESULTS In this simulation we randomly assigned the non-choice observations to the choice data observations to see what would happen to the segmentation results. Both the latent class choice model results and the latent class clustering of the HB derived utilities results suggest more segments than before. The optimal number of segments for the latent class choice model segmentation is 17; while the latent clustering of derived HB utilities suggests 12. In the latent class choice model, all but one of the original 4 segments split into 4 smaller segments, differentiated by the non-choice data (one segment split into 5 segments accounting for a few 242 outliers respondents differentiated by the utilities). In the clustered derived HB utilities segmentation, 3 of the 4 original segments split into 3 sub-segments, while one split into 4 segments. These segments are also differentiated on the non-choice data, but not as perfectly as the latent class choice model segments. Tables 4 and 5 show the cross tabulations of the original utility generated segments by the BIC-optimal number of segments derived. Table 4: Latent class choice model segment cross-tabulation (Optimal solution as columns; original as rows); Highly differentiated—Random Orig Segs Seg 1 Seg 2 Seg 3 Seg 4 Seg 1 25% 18% 26% 31% Seg 5 Seg 6 Seg 7 Seg 8 25% 25% 24% 26% Seg 2 Seg 9 Seg 3 Seg 10 Seg 11 Seg 12 Seg 13 Seg 14 Seg 15 Seg 16 Seg 17 20% 19% 32% 29% Seg 4 Total 6% 5% 7% 8% 6% 6% 6% 7% 5% 5% 8% 7% 6% 23% 26% 25% 20% 2% 6% 7% 6% 5% Table 5: Derived HB utilities latent cluster segmentation cross-tabulation (Optimal as solution columns; original as rows); Highly differentiated—Random Orig Segs Seg 1 Seg 2 Seg 3 Seg 1 51% 31% 18% Seg 2 Seg 4 Seg 5 Seg 6 50% 25% 25% Seg 3 Seg 7 Seg 8 Seg 9 50% 48% 2% Seg 4 13% Size 8% 5% 12% 6% 6% 12% 12% Seg 10 Seg 11 Seg 12 1% 51% 25% 23% 1% 13% 6% 6% This contrived pattern of results suggests the possible use of a joint basis, or joint objective, latent class segmentation. Joint basis latent class segmentation would use two nominally scaled latent variables for producing segmentation on two different sets of variables, in this case the choice data and the non-choice data. This is equivalent to the joint segmentation work originally proposed by Ramaswamy, Chatterjee and Cohen (1996) who simultaneously segment different sets of basis variables allowing either for independent or dependent segments to be derived (by allowing correlations among the nominally scaled latent variables). This is simple to accomplish within the syntax framework of Latent Gold. Table 6 below depicts the cross-tabulation of results from the joint latent class choice model segmentation using the choice and non-choice data as separate nominal scaled latent variables. The pattern of segment assignment is nearly perfect. Table 6: Latent class choice model using joint segmentation (choice and non-choice data— 2 latent variables) Predicted w/ 2 LVs Known Non-Choice Segs Non-Choice Segs Choice Segs Seg 1 Seg 2 Seg 3 Seg 4 Total Choice Segs Seg 1 Seg 2 Seg 3 Seg 4 Total Seg 1 25 18 31 26 100 Seg 1 25 18 31 26 100 Seg 2 26 25 25 24 100 Seg 2 26 24 26 24 100 Seg 3 29 32 20 19 100 Seg 3 29 31 21 19 100 Seg 4 20 26 23 31 100 Seg 4 20 27 22 31 100 Total 100 101 99 100 400 Total 100 100 100 100 400 243 CASE 4: LESS DIFFERENTIATED—ALIGNED RESULTS Figure 10 depicts the segment centroids for both the latent class choice model segmentation and the derived HB utilities latent class clustering. In neither case was the actual number of 4 segments suggested as the optimal solution. Latent class choice modeling’s optimal solution consists of 9 segments, whereas the derived HB utilities latent class clustering yielded 7 segments. Table 7 shows how the actual known 4 segment solution breaks out across the two derived solutions. Figure 10: Less Differentiated—Aligned; Derived segment centroids Table 7: Less-Differentiated—Aligned; Cross-tabulations of both segmentations against actual segments Latent Class Choice Model Non-Choice Segs Orig Segs Seg 1 Seg 2 Seg 3 Seg 4 Seg 5 Seg 6 Seg 1 40% 60% Seg 2 59% 41% Seg 3 48% 52% Seg 4 Total 40 60 59 41 48 52 Seg 7 Seg 8 Seg 9 39% 39 27% 27 35% 35 Latent Clustering with Derived HB Utilities Non-Choice Segs Orig Segs Seg 1 Seg 2 Seg 3 Seg 4 Seg 5 Seg 6 Seg 1 76% 24% Seg 2 57% 43% Seg 3 100% Seg 4 54% Total 76 24 57 43 100 54 Seg 7 46% 46 Without going into the details of each segment’s profile, the general conclusions that could be drawn are the less differentiated the choice data is: 1) the more segments either approach will 244 give us, 2) the distinctions between the segments becomes murkier and differences are only in the extremes, and 3) the non-choice data begins to more strongly influence the solution. With these results, it is hard to make a concrete recommendation to use latent class choice modeling incorporating non-choice data over the sequential derived HB utilities combined with non-choice approach. Issues are raised, but the evidence is not clear. Conceptually, if one’s goal is segmentation, one should use segmentation methods without resorting to using the sequential HB approach. However, this last simulation, which is more likely the situation we would see using real world data, suggests that either approach might work. REAL-WORLD EXAMPLE A latent class choice model using non-choice data was estimated for a segmentation project in which customers of international duty free shops were surveyed. The objective of the research was to develop a global segmentation that would enable the suppliers of duty free shops to tailor their products to the customers frequenting the shops. The client expected a large number of segments, of which only a few might be of direct interest to a specific supplier (e.g., tobacco, liquors, or electronics). The segmentation bases included actual shopping behaviors, benefits derived from shopping (MaxDiff task 1), the desire for promotions (MaxDiff task 2), and attitudes towards shopping. An online survey was conducted among 3,433 travelers recruited at 28 international airports around the world who had actually purchased something in a duty free shop on their last trip. The questionnaire included the two MaxDiff tasks, and a set of bi-polar semantic differential scale questions regarding attitudes towards shopping in general and duty-free shopping in particular. Behavioral items included the types of trip (e.g., business vs. leisure; economy vs. first class), the frequency of trips, their spending at duty-free shops, and socio-demographics. A simultaneous, single latent variable, latent class choice model which included the nonchoice data was estimated. A batch file to determine solutions with 2 to 25 segments was created and took over 2 days to run. Many different runs were made with direct client involvement in the evaluation of the interim solutions. Variables were dropped, added and transformed to tailor the solution and drive toward a solution the client felt was actionable. This process took over 3 weeks. In the final run, the BIC-optimal configuration had 13 segments; the client chose the 14 segment solution as the best from a business standpoint. Figures 11, 12, and 13 summarize the final set of segments. 245 Figure 11: Real World Example Segment Summaries—Part 1 Figure 12: Real World Example Segment Summaries—Part 2 246 Figure 13: Real World Example Segment Summaries—Part 3 DISCUSSION Segmentation is hard. One must be very careful when conducting segmentation. I strongly believe if the main objective of the research is segmentation one should use segmentation methods. Prior to the development of Latent Gold it was difficult for the practitioner to simultaneously segment respondents on the basis of choice and non-choice data. Practitioners could either: Segment solely using the choice data (preferably with a latent class MNL model) and then profile the resulting segments on the non-choice data of interest; Separately segment the choice and non-choice data and bring the results together using either ensemble techniques or by adding one segmentation assignment to the other subsequent segmentation as an additional variable; or Derive MNL utility estimates using HB methods and add the non-choice data to them before performing segmentation analyses. This paper demonstrated issues that arise using the third approach. The ideal approach is to conduct a simultaneous segmentation using both choice and non-choice data together. Latent Gold’s syntax enables the practitioner to conduct such analyses.3 While we recommend simultaneously segmenting choice and non-choice data using latent class methods when the objective is purely segmentation, we do not wish to suggest segmenting derived HB utilities is 3 In addition to the Appendix of the paper, the author’s website, www.eagleanalytics.com provides some examples from the appendix of the presentation that shows one how to set up such a segmentation, as well as the original presentation slides. 247 “wrong.” Rather, our goal was to show why using derived HB utilities within segmentation without careful consideration of issues such as the setting of priors and the magnitude of rescaling the utilities can affect the results one sees. But, these issues are avoided when conducting the segmentation simultaneously. We also demonstrated that when one takes the approach of simultaneously segmenting choice and non-choice data via Latent Gold’s syntax module, one can extend the segmentation in several ways: 1) A joint segmentation on more than one set of basis variables can be performed. Using more than one nominally scaled latent variable and allowing the correlation among the latent variables to be estimated is the classic case of a joint basis (or joint objective) segmentation; 2) One may include more than one choice task in the segmentation. The real-world example segmented 2 MaxDiff tasks and non-choice data simultaneously into a common set of segments described by a single nominally scaled latent variable; and 3) Although not demonstrated here, one may also extend the segmentation by including MNL scale segments. Previous Sawtooth Software presentations (Magidson & Vermunt, 2007) have shown the segmentation approach to estimating relative scale differences among segments of respondents that are appropriately estimated according to the MNL model. These scale segments may or may not be allowed to be correlated with the other nominally scaled latent variables derived. This paper and presentation was: 1) Not a bake-off between methods. 2) Not a software recommendation. With some effort, similar analyses may be conducted within MPLUS; 3) Not about using ensemble analysis to evaluate multiple solutions (which could obviously be applied to either case); and, lastly, 4) The recommendation to use the simultaneous approach is based upon satisfying the objectives of a segmentation using a segmentation method that does not require sequential estimation, not the naïve simulations presented above. When clear patterns of differentiation exist in data most any segmentation method will find those patterns. However, when such clear differentiation does not exist every segmentation method will find some solution—a solution influenced by the method itself and its characteristics. This is the essence of one result in the research conducted by Dolnicar and Leisch (2010) who state: If natural clusters exist in the data, the choice of the clustering algorithm is less critical . . . If clusters are constructed, the choice of the clustering algorithms is critical, because each algorithm will impose structure on the data in a different way and, in so doing, affect the resulting cluster solution. 248 APPENDIX Example of complete syntax for a single latent variable simultaneous choice/non-choice model segmentation. 249 Thomas Eagle REFERENCES Dolnicar, S. and F. Leisch, (2010) “Evaluation of structure and reproducibility of cluster solutions using the bootstrap,” Marketing Letters, Volume 21 (1), pp 83–101. Magidson, J and J. Vermunt, (2002) “Latent class models for clustering: A comparison with Kmeans,” Canadian Journal of Marketing Research, Volume 20, pp 37–44. Magidson, J and J. Vermunt, (2007) “Removing the Scale Factor Confound in Multinomial Logit Choice Models to Obtain Better Estimates of Preference,” 2007 Sawtooth Software Conference Proceedings, pp 139–154. Ramaswamy, V.R., R. Chatterjee and S.H. Cohen, (1996) “Joint segmentation on distinct interdependent bases with categorical data,” Journal of Market Research, Volume 33 (August), pp 337–55. 250 EXTENDING CLUSTER ENSEMBLE ANALYSIS VIA SEMI-SUPERVISED LEARNING EWA NOWAKOWSKA1 GFK CUSTOM RESEARCH NORTH AMERICA JOSEPH RETZER CMI RESEARCH, INC. INTRODUCTION Market segmentation analysis is perhaps the most challenging task market researchers face. This is not necessarily due to algorithmic or computational complexity, data availability, lack of methodological approaches, etc. The issue instead is that desirable market segmentation solutions simultaneously address two issues: Partition quality and Facilitation of market strategy (actionability). Various approaches to market segmentation analysis have attempted to address the multi-goal issue described above (see, for example, “Having Your Cake and Eating It Too? Approaches for Attitudinally Insightful and Targetable Segmentations” (Diener and Jones, 2009)). This paper describes a completely new and effective approach to the challenge. Specifically we extend cluster ensemble methodology by augmenting the ensemble partitions with those derived from a supervised learning (Random Forest—RF) predictive model. The RF partitions incorporate profiling information indicative of target measures that are most of interest to the marketing manager. Segmentation membership is more easily identified on the basis of previously selected attributes, behaviors, etc. of interest. Also, the consensus solution produced from the ensemble is of high quality facilitating differentiation across segments. 1 Ewa Nowakowska, 8401 Golden Valley Rd, Minneapolis, MN 55427, USA, T: 515 441 0006 | [email protected] Joseph Retzer, CMI Research Inc., 2299 Perimeter Park Drive, Atlanta, Georgia 30341, USA. T: 678 805 4013 | [email protected] 251 I. MARKET SEGMENTATION CHALLENGES Figure 1.1 Standard Deck of Cards Consider an ordinary deck of 52 playing cards shown in Figure 1.1. To illustrate the challenges of market segmentation we may begin by asking the question: “What is a high quality partition that might be formed from the deck?” First, a few definitions: Partition: In market segmentation a partition is nothing more than an identifier of group membership, e.g., a column (vector) of numbers, one entry for each respondent in our dataset, which identifies the segment to which our respondent is assigned. High Quality: A high quality partition is one in which the groups identified are similar (homogeneous) within a specific group while differing (heterogeneous) across groups. A number of potential, arguably high quality, partitions may come to mind: Red vs. Black cards (2 segment partition) Face vs. Numeric cards (2 segment partition) Clubs vs. Diamonds vs. Spades vs. Hearts (4 segment partition) Aces vs. Kings vs. Queens vs. . . . (13 segment partition) Etc. Given multiple potential partitions, the question becomes “Which one should we use?” In the case of the example above we might answer, “It depends on the game you intend to play.” More generally, the market researcher would respond, “We choose the partition that best facilitates marketing strategy.” Addressing the fundamental and nontrivial dilemma illustrated by this simple example is the focus of this paper. II. HIGH QUALITY AND ACTIONABLE RESULTS As noted earlier, high quality cluster solutions exhibit high intra-cluster similarity and low inter-cluster similarity. Achieving high quality clusters is the primary focus of most, if not all, commonly used clustering methods. In addition, numerous measures of quality also exist e.g., 252 Dunn Index Hubert’s Gamma Silhouette plot/value Etc. All of the above are in some way attempting to simultaneously reflect homogeneity within and heterogeneity between segments. While various algorithms perform reasonably well in achieving cluster quality, a new, computationally intensive ensemble approach, cluster ensemble analysis, has been shown to produce high quality results on both synthetic and actual data. Cluster ensembles also offer numerous additional advantages when applied to multi-faceted data (see Strehl and Ghosh (2002), Retzer and Shan (2007)). The second desirable outcome of a segmentation model, actionable results, is often more difficult to achieve. The marketing manager cannot implement effective marketing strategy without being able to predict membership in relevant groups, i.e., the segments must be actionable. Actionability focuses on customer differences and similarities that best drive marketing strategies. Marketing strategies might include such activities as: Developing brand messages appealing to different customer types, Designing effective ad materials, Driving sales force tactics and training, Targeted marketing campaigns, and Speaking or reacting to concerns of most desirable customers. It is important to note that actionability is not necessarily achieved by insuring a high quality partition. Consider an illustration in which a biostatistician produces a genome-based partition that perfectly segments gene sequences identifying hair color. Assume the resultant clusters are made up of, for example: Cluster 1: Blond hair respondents Cluster 2: Brown hair respondents Cluster 3: Red hair respondents Etc. Further assume that there is no overlap in hair colors across the clusters. A question we might ask is “does this represent a desirable partition?” Clearly the answer would be “no” if in fact the biostatistician was engaged in cancer research. It is worth noting that occasionally researchers may be tempted to simply include customer demographics (or other actionable information) directly into the segmentation as basis variables. This is undesirable for a number of reasons including: • • There is no reason to necessarily expect groups to form in such a way as to facilitate identification of respondent membership in desired clusters. This may or may not happen since we are focused only on simultaneous relationships between attributes as opposed to relationships directly related to the prediction of a “market strategy facilitating” (e.g., purchasers vs. non-purchasers) outcome. The introduction of additional/many dimensions in a segmentation analysis, which requires simultaneous similarity across ALL dimensions in the dataset, typically leads to a relatively flat (similar) cluster solution of poor quality. 253 In order to ensure the inclusion of information that is directly related to the prediction of relevant market strategy variables (e.g., Purchaser vs. Non-purchaser), we turn to an analytical approach well suited to prediction, Random Forest analysis. III. MACHINE LEARNING TERMINOLOGY Before going forward into more detail around the specifics involved in Semi-Supervised Learning, a brief aside defining some commonly used machine-learning terminology is called for. We provide a brief description of Unsupervised, Supervised and Semi-Supervised Learning below. Unsupervised Learning: Unsupervised learning is a process by which we learn about the data without being supervised by the knowledge of respondent groupings. The data is referred to as “unlabeled,” implying the groupings, whatever they may be, are latent. Common cluster analysis algorithms are examples of unsupervised learning. There are many examples of such algorithms including, e.g., • • • • Hierarchical methods Non-hierarchical: k-means, Partitioning Around Medoids (PAM) etc. Model Based: Finite Mixture Models Etc. Supervised Learning: Supervised learning on the other hand, is the process in which we learn about the data while being supervised by the knowledge of the groups to which our respondents belong. In supervised learning external labels are provided (e.g., purchaser vs. nonpurchaser). Supervised learning algorithms include such as methodologies as, e.g., • • • • • Classification And Regression Trees (CART) Support Vector Machines (SVM) Neural Networks (NN) Random Forests (RF) Etc. Semi-Supervised Learning: Semi-Supervised Learning involves learning about data where complete or partial group membership in market strategy facilitating segments is known. In essence semi-supervised learning combines aspects of both unsupervised and supervised learning. The implementation of semi-supervised learning models may be performed in a variety of ways. An example is found in “A Genetic Algorithm Approach for Semi-Supervised Clustering” by Demiriz, Bennett and Embrechts. While this illustration employs partially labeled data, requiring an additional intermediate step to assign labels to all data points, the underlying approach is conceptually the same as is found in this presentation. Our implementation of semisupervised learning is achieved via cluster ensemble analysis. IV. CLUSTER ENSEMBLES (UNSUPERVISED LEARNING) Cluster ensembles or consensus clustering analysis is a computationally intense data mining technique representing a recent advance in unsupervised learning analysis (see Strehl and Gosh (2002)). 254 Cluster Ensemble Analysis (CEA) begins by generating multiple cluster solutions using a collection of “base learner” algorithms (e.g., PAM (Partitioning around Medoids)), finite mixture models, k-means, etc.). It next derives a “consensus” solution that is expected to be more robust and of higher quality than any of the individual ensemble members used to create it. Cluster Ensemble solutions exhibit low sensitivity to noise, outliers and sampling variations. CEA effectively detects and portrays meaningful clusters that cannot be identified as easily with any individual technique. CEA has been suggested as a generic approach for improving the quality and stability of base clustering algorithm results (high quality cluster solutions exhibit high intracluster similarity and low inter-cluster similarity). CEA provides a framework for incorporating numerous unsupervised learning ensemble members (generated via PAM, k-means, hierarchical, etc.) as well as augmentation of the unsupervised ensemble with partitions reflecting supervised learning analysis. V. RANDOM FOREST ANALYSIS (SUPERVISED LEARNING) The supervised learning information incorporated into the ensemble is provided by Random Forest (RF) analysis. RF analysis is a tree-based, supervised learning approach suggested by Breiman (2001). It represents an extension of “bagging” (bootstrap aggregation), also suggested by Breiman (1996). RF’s are algorithmically similar to bagging, however additional randomness is injected into the set of estimated trees by evaluating, at each node, a subset of potential split variables selected randomly from the eligible group. An intuitive description of RF analysis may be given as follows. Consider illustration 5.1 below, Figure 5.1 Random Forests Just as a single tree may be combined with many trees to create a forest, a single decision tree may be combined with many decision trees to create a statistical model referred to as a Random Forest. Categorical variable prediction in RF analysis is accomplished through simple majority vote across all forest tree members. That is to say a respondents data may be “dropped” through each of the RF trees and a count made as to the number of times the individual is classified in any of the categories (by virtue of the respondent position in each tree’s terminal 255 nodes). Whichever category the respondent falls into most often is the winner and the respondent is subsequently classified in that group. Empirical evidence suggests Random Forests have higher predictive accuracy than a single tree while not over-fitting the data. This may be attributed to the random predictor selection mitigating the effect of variable correlations in conjunction with predictive strength derived from estimating multiple un-pruned trees. RF’s offer numerous advantages over other supervised learning approaches, not the least of which is markedly superior out-of-sample prediction. Other advantages of RF analysis include: Handles mixed variable types well. Is invariant to monotonic transformations of the input variables. Is robust to outlying observations. Accommodates several strategies for dealing with missing data. Easily deals with a large number of variables due to its intrinsic variable selection. Facilitates the creation of a respondent similarity matrix. VI. COMBINING SUPERVISED WITH UNSUPERVISED LEARNING (SEMI-SUPERVISED LEARNING) While it is clear that RF analysis is well suited to supervised learning and predictive analysis, what may be less clear is how can it be used in the context of cluster ensembles to facilitate semi-supervised learning. This process is described below. As a first step, an n x n (where n is the total number of respondents in our study) null matrix is created, T = total number of trees. Next, for each tree in the RF analysis (typically around 500 trees are created) holdout observations are passed through the tree and respondent pairs are observed in terms of being together or otherwise, in the trees’ terminal nodes. Specifically, 256 For each of T trees, if respondent i and respondent j (i ≠ j) both land in the same terminal node, increase the i, jth element by 1. The final matrix, SRF,, is a count for every possible respondent pair, of the number of times each landed in the same terminal node. The resultant matrix then may be considered a similarity matrix of respondents based on their predictive information pertaining to a selected marketing strategy facilitating measures. Similarity matrices may be used to create cluster partitions. These partitions in turn may be included in ensembles that also contain partitions based on purely unsupervised learning analysis. It is in this way we are able to combine latent with observed group membership information to produce a semi-supervised learning consensus partition. It is important to note that we employed Sawtooth Software’s Convergent Cluster Ensemble Analysis (CCEA) package to arrive at a consensus clustering. Not only does this facilitate the creation of an unsupervised ensemble but in addition adds the stability afforded by a convergent solution into the final result. VII. SHOWCASE 1. The data and the task This section presents the results obtained when applying the proposed method to real data. The data described general attitudes to technology with special focus on mobile phones. The primary goal was to see how the respondents segment on attitudes towards mobile phones, so the attitudinal statements were considered the basis variables. An accurate cluster assignment tool was also required, however, it was supposed to be based not on the category-specific statements such as the attitudes but on more general questions such as lifestyle. Such general questions can be asked in virtually any questionnaire, making the segments detectable regardless of the specific subject of the survey. Hence, the basis and future predictive variables differed. Last but not least, the segments were expected to profile well on behavioral variables, which in this case was mobile phone usage, of key importance for segments’ understanding. This block of statements is later referred to as additional profiling variables. Figure 7.1 shows the summary of the data structure. Figure 7.1 The data structure As noted earlier, there are several reasons why inserting all these variables alike in a standard segmenting procedure is likely to be unsuccessful. In addition, for this data set, behavioral variables tend to dominate the segmenting process, producing very distinct yet hardly operational partitions. Also, long lists of variables are typically not recommended, as the distinctive power 257 spreads roughly evenly among multiple items, resulting in solutions of flat and hardly interpretable profiles. 2. The analysis The analysis was conducted in three steps, where the first two can be considered parallel. First, Sawtooth Software’s Convergent Cluster Ensemble Analysis (CCEA) was performed, taking the base variables as the input. In this case default settings for ensemble analysis were used. This produced not only the CCEA consensus solution, which was later used for comparisons, but also a whole ensemble of partitions, which constitutes one of two pillars of the method. Second, Random Forests (RF) were used to predict the additional profiling variables with the set of predictive variables. In this case, mobile phone usage was explained by lifestyle statements. For each profiling variable there was a single Random Forest and each produced a single similarity matrix, referred to also as a proximity matrix. In this case there were multiple additional profiling variables of interest, hence there were also multiple similarity matrices. Each similarity matrix was then used to partition the observations. For each of them combinatorial (Partitioning Around Medoids (PAM)) as well as hierarchical (average linkage) clustering algorithms were used, the number of clusters was also varied within the range of interest, which was from 3 to 7 groups. The diversity was desired, as typically more diverse ensembles lead to richer and more stable consensus solutions. Altogether, it produced a large ensemble of RF-based partitions, which were then merged with the original CCEA ensemble. The Convergent Cluster Ensemble Analysis was performed again, now producing the semi-supervised CCEA consensus solution. The overview of the process is presented in Figure 7.2. Figure 7.2 The analytical process The Random Forest analysis was programmed in R, primarily with the ‘randomForest’ package. The ensemble clustering was done with Sawtooth Software’s Convergent Cluster 258 Ensemble Analysis, which facilitated the whole process to a great extent. Ensemble clustering can also be done in R with ‘clue’ package, which offers flexible and extensible computational environment for creating and analyzing cluster ensembles, however most of the process must be explicitly specified by the user, making it more demanding and time consuming. In Sawtooth Software’s CCEA the user chooses between standard CCA (Convergent Cluster Analysis, via Kmeans) and CCEA (Convergent Cluster Ensemble Analysis). Selecting the latter, one needs to make the key decision how the ensemble should be constructed. It can be done in the course of the analysis, which we exploited in the first step of the process, but it can also be obtained from outside sources, which we used in the last step extending the standard CCEA ensemble by the RF-based partitions. The ultimate consensus solution was selected based on the quality report including reproducibility statistics. 3. The results Quality comparison. The semi-supervised CCEA (SS-CCEA) consensus solution was compared to the standard attitude-based CCEA consensus solution obtained in the first step of the analysis. The reported reproducibility equaled 72% for the standard CCEA and 86% for its semi-supervised extension SS-CCEA, which was a relatively high result for both groupings given the number of clusters and statements. The difference in favor of SS-CCEA is possibly due to its more diverse and larger ensemble, however it tells us that extending the ensemble by the RF-based partitions tends to increase reproducibility rather than spoiling what has originally been a good CCEA solution. 6 44 158 64 1 6 432 67 25 80 2 13 125 383 50 99 3 111 8 131 337 47 4 60 77 192 103 215 5 2 3 4 5 CCEA 448 1 Figure 7.3 Transition matrix SS−CCEA Attitudinal (basis) variables. The matrix in Figure 7.3 is a transition matrix, which shows the size of the intersection between the original CCEA and the SS-CCEA segments. The possible issue of labels switching was taken care of and hence the diagonal represents the observations that remained in the original segments after the semi-supervised modification. One can see that the diagonal dominates, which indicates that both solutions have very much in common. Therefore there are grounds to believe the semi-supervised part only slightly modified the original CCEA solution rather than producing an entirely new partition. 259 Figure 7.4 Absolute profiles for CCEA and SS-CCEA solutions att_13 att_13 att_12 att_14 att_12 att_11 att_15 ● ● att_10 ● ● ● ● ● att_16 ● ● ●● ● ● ● ● ● ● ●● ● ●● att_1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● att_1 ● ● ● ●● ● ● ● ● ●● att_2 ● ● att_8 ● ● ● ●● ●● ● ● ● ● att_2 ● ● ● ● ● ● ●● att_3 ● att_6 ●● ● ● ● ● att_7 ● ● ●● ● ● ● ● ● ● att_8 ● ●● ● ●● att_9 ● ● ● ● ● ● ● ● ● att_9 ● ● ● ● att_16 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● att_10 ● ● ● att_15 ● ● ● ● ● att_11 ● ● ● ● att_14 ● att_7 att_4 att_6 att_5 att_3 att_4 att_5 This is confirmed by how similarly the segments profile on the basis (attitudinal) variables. This can be observed in the radial plots in Figure 7.4 showing the absolute profiles, with shades indicating segments. Hence altogether, given that CCEA consensus solution was of high quality, the SS-CCEA is of high quality as well. Usage (additional profiling) variables. Let us now compare the behavior of the additional profiling variables, which was the category usage the segments were expected to profile on. To measure this, we used Friedman’s importance (relevance) as described in the work of Friedman and Meulman (2004). Intuitively, for each given variable and each given segment, it can be thought of as the ratio of variance over all observations with respect to the variance within the segment. Technically, the coefficient is defined in terms of spread, which is more general than variance and uses also an additional normalizing factor but the idea remains. So, the large values the more important the variable and the better the grouping profiles on it. Figure 7.5 Additional profiling variables: Friedman’s importance 260 0.5 0.5 0.5 0.0 0.0 0.0 1.0 1.5 2.0 Cluster 5 0.5 1.0 1.5 2.0 Cluster 4 0.0 1.0 1.5 2.0 Cluster 3 0.5 1.0 1.5 2.0 Cluster 2 0.0 1.0 1.5 2.0 Cluster 1 The charts of Figure 7.5 show the values of the coefficient for the most important attributes in the segments. Each chart corresponds to a segment and each bar to an attribute. The most important attributes differ across the segments, which is indicated by varied shades of the corresponding bars. The light sub-bars stacked on top indicate the increase in importance due to the semi-supervised part. So, for instance the first attribute in the second segment is very important and RFs virtually do not have effect on that. However the same attribute in the third cluster is also most important but in this case RFs still increase its importance substantially. Generally, we can observe that the semi-supervised modification increased the importance of some of the attributes, so altogether the SS-CCEA solution profiles better on usage than the standard CCEA partition did. Lifestyle (predictive) variables. Finally, let us examine the predictive performance of the extended SS-CCEA segmentation, which is the ability to assign each new coming observation to its segment with sufficient accuracy. To build the cluster assignment model, the RF approach was used again. So another Random Forest was built, to predict cluster membership. To assess its performance the RF’s intrinsic mechanism for unbiased error estimation was exploited. In a Random Forest each tree is constructed using a different bootstrap sample. So the cases that are left out—the holdout sample, which is here called out-of-bootstrap (bag) (OOB)—are put down the tree for classification and used to estimate the error rate. This is called the OOB error rate. Figure 7.6 OOB error rates Figure 7.6 presents the OOB error rates for both solutions as well as their change with the increase in the number of trees in the forest. On the x-axis the number of trees in the forest is given, varying from 0 to 100. The y-axis shows the OOB error rates, the j-th element showing the result for the forest up to the j-th tree. The chart not only shows that in terms of the classification error rates the SS-CCEA tends to outperform the standard CCEA but it also reminds of the role of the size in the forest, as up to the certain point the error rates drop substantially with the increase in the number of trees. In brackets the OOB error rates for the entire forest (i.e., 100 trees) are given. This means that for SS-CCEA the cluster membership can be predicted with almost 85% accuracy, which is a very good result. What happens often though, is that although the overall classification error rate might decrease, it decreases only on average. In other words, this means that while some of the segments are detected more accurately, for some others we observe decrease in the classification 261 accuracy. However, this is not the case here. The chart of Figure 7.7 pictures the values in the table and shows that the error rates for all the segments consistently drop. So not only the overall error rate decreases but it also decreases for each cluster. Figure 7.7 OOB error rates per segment Segm 1 Segm 2 Segm 3 Segm 4 Segm 4 CCEA 0.19 0.27 0.30 0.32 0.34 SS-CCEA 0.09 0.14 0.16 0.15 0.25 4. Profiling & Prediction In virtually any segmentation study, an important goal is to enable the marketing manager to profile respondents and interactively predict membership of new individuals into derived clusters. RF analysis is employed for both purposes. RF analysis provides powerful diagnostics and descriptive measures for example, Evaluating attribute importance. Predicting new respondent cluster membership, etc. Profiling cluster solutions via partial dependence plots. One profiling capability, importance measurement, is briefly described below. In order to calculate importance for a given attribute of interest, its value is randomly permuted in each holdout (aka OOB, Out-Of-Bag) sample. Next the OOB sample data is run through each tree and the average increase in prediction error across all T trees is calculated as: This measure of average increase in predictive error also serves as the attribute’s importance measure. The reasoning is straightforward, if a variable is important to the prediction of the dependent measure, permuting its values should have a relatively large, negative impact on predictive performance. The charts of Figure 7.8 show the two basic measures of importance. Mean Decrease Accuracy is described above, while Mean Decrease Gini captures the decrease in node impurities resulting from splitting on the variable, also averaged over all trees. This information was used to select top most important statements to employ them in the typing tool described below. 262 Figure 7.8 Importance of the explanatory variables for the Random Forest model Q1X26 ● Q1X27 ● Q1X5 ● Q1X29 ● Q1X30 ● Q1X16 ● Q1X26 ● Q1X27 ● Q1X5 ● Q1X29 ● Q1X19 ● Q1X16 Q1X22 ● Q1X10 Q1X8 ● Q1X15 ● ● ● Q1X19 ● Q1X22 ● Q1X1 ● Q1X20 ● Q1X13 ● Q1X20 ● Q1X30 ● Q1X13 ● Q1X4 ● Q1X1 ● Q1X9 ● Q1X18 ● Q1X10 ● Q1X18 Q1X15 ● ● Q1X4 ● Q1X9 ● 20 Q1X8 30 40 50 60 MeanDecreaseAccuracy ● 0 100 200 300 MeanDecreaseGini Prediction in RF analysis has been described previously as a straightforward majority vote across multiple decision trees in a Random Forest. A practical consideration however involves how such a algorithm may be implemented for use by the marketing manager. The above task was accomplished in a very effective manner by construction of an interactive, web browser based interface, (which may be locally or remotely hosted) allowing access to an R statistical programming language object. More specifically, by utilizing the capabilities afforded in the R package “shiny” we were able to construct a powerful yet easy to use interface to a Random Forest predictive object. In addition to prediction of “relative” cluster membership, the tool also allowed for identification of attribute levels leading to maximal probability of membership in any one of the selected groups. These attribute levels are identified via application of a genetic algorithm search in which specific class membership served as the optimization target. A screenshot of the predictive tool is shown in Figure 7.9. 263 Figure 7.9 Segment prediction tool VIII. SUMMARY In this paper we extend the ensemble methodology in an intelligent and practical way to improve the consensus solution. We employ Sawtooth Software’s CCEA (“Convergent” Cluster Ensemble Analysis) so as to both improve consensus partition stability and facilitate ease of estimation. The set of ensemble partitions are augmented with partitions derived from the supervised learning analysis, Random Forests. Supervised learning partitions are created from a similarity matrix based on Random Forest decision trees. These partitions are particularly useful in that they incorporate profiling information directly indicative of analyst-chosen target measures (e.g., purchaser vs. non-purchaser). We compare/contrast Semi-Supervised Convergent Cluster Ensembles Analysis (SS-CCEA) with alternate solutions based on both cluster profiles and out-of-sample post hoc prediction. Post hoc prediction of cluster membership improves unilaterally across clusters while cluster profiles show minimal changes. Finally, we implement an interactive, browser based cluster simulation tool using the R “shiny” package. The tool enables direct access to the Random Forest object, which in turn produces superior predictive results. The tool may be hosted either locally or remotely allowing for greater flexibility in deployment. Our approach may be referred to as “semi-supervised learning via cluster ensembles” or, in this case “Semi-Supervised Convergent Cluster Ensembles Analysis” (SS-CCEA). Importantly, this paper highlights the ease of performing the analysis through the use of Sawtooth Software’s CCEA package. CCEA both empowers the practitioner to efficiently perform ensemble analysis as well as allowing for simple/easy augmentation of the ensemble with externally derived partitions (in this case produced via supervised learning, RF, results). 264 Ewa Nowakowska Joseph Retzer IX. REFERENCES L. Breiman, Bagging predictors, Machine Learning, 24:123–140, 1996. L. Breiman, Random forests, Machine Learning, 45(1): 5–32, 2001. A. Demiriz, K. Bennett and M. Embrechts, A Genetic Algorithm Approach for SemiSupervised Clustering, Journal of Smart Engineering System Design, 4: 35–44, 2002. C. Diener and U. Jones, Having Your Cake and Eating It Too? Approaches for Attitudinally Insightful and Targetable Segmentations, Sawtooth Software Conference Proceedings, 2009. J. Friedman and J. Meulman, Clustering Objects on Subsets of Attributes, Journal of the Royal Statistical Society: Series B, 66 (4): 815–849, 2004. J. Retzer and M. Shan (2007), “Cluster Ensemble Analysis and Graphical Depiction of Cluster Partitions,” Proceedings of the 2007 Sawtooth Software Conference Proceedings. A. Strehl and J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining multiple partitions, Machine Learning Research, 3: 583–417, 2002. 265 THE SHAPLEY VALUE IN MARKETING RESEARCH: 15 YEARS AND COUNTING MICHAEL CONKLIN STAN LIPOVETSKY GFK We review the application of the Shapley Value to marketing research over the past 15 years. We attempt to provide a comprehensive understanding of how it can give insight to customers. We outline assumptions underlying the interpretations so that attendees will be better equipped to answer objections to the application of the Shapley Value as an insight tool. Imagine it is 1998. My colleague Stan Lipovetsky, is working on a TURF analysis (Total Unduplicated Reach and Frequency) for product line optimization. Stan, being new to marketing research, asked the obvious question—“What are we trying to do with the TURF analysis?” TURF1 is a technique that was first used in the media business to understand which magazines to place an advertisement in. The goal was to find a set of magazines that would maximize the number of people who would see your ad (unduplicated reach) as well as maximizing the frequency of exposure among those who were reached. This was adapted for marketing research for use in product line optimization. Here, the idea was to find a set of products to offer in the marketplace such that you would maximize the number of people who would buy at least one of those products. The general procedure at the time was to ask consumers to give a purchase interest scale response for each potential flavor in a product line. Then the TURF algorithm is run to find the pair of flavors that maximizes reach (the number of people who will definitely buy at least one product of the two), the triplet that maximizes reach, the quad that maximizes reach and so on. TURF itself is an np-hard problem. To be sure you have found the set of n products that maximizes reach you must calculate the reach for all possible sets of n. Stan looked at the calculations we were doing for the TURF analysis and said, “This reminds me of something I know from game theory, the Shapley Value.” “So, what is the Shapley Value?” I asked. And so began a 15-year odyssey into the realm of game theory and a single tool that has turned out to be very useful in a variety of situations. THE SHAPLEY VALUE Shapley first described the Shapley Value in his seminal paper in 1953.2 The Shapley Value applies to cooperative games, where players can band together to form coalitions, and each coalition creates a value by playing the game. The Shapley Value allocates that total value of the game to each player. By evaluating over all possible coalitions that a player can join in, a value for each specific player can be derived. 1 2 Wikipedia (Shapley, 1953) 267 Formally the Shapley Value for player i is defined as: i i ) S (S n( s ) S - all subsets ( s 1)!(n s)! n! So, summing across all possible subsets of players S, the value of player i is the value of the game for a subset containing player i minus the value of that same subset of players without player i. In other words, it is the marginal value of adding the player to any possible set of other players. The summation is weighted by a factor that reflects the number of subsets of a particular size (s) that are possible given the total number of players (n). Where: n( s) When we apply the concept to the TURF game we have a situation where we create all possible sets of products, and calculate the “value” of each set by determining its “reach,” or the percent of consumers in the study who would buy at least one item in the set. By applying the Shapley Value calculation to this data we can allocate the overall reach of all of the items to the individual items. This gives us a relative “value” of each individual product. The values of these products add up to the total value of the game, or the reach, of all of the products. The fact that we can apply this calculation to the TURF game doesn’t necessarily mean that it is useful. And, it certainly appears that the Shapley Value is an np-hard problem as well. We need to calculate the overall reach or value of every possible subset of products to even calculate the Shapley Value for each product. Fortunately, the TURF game corresponds to what is known in game theory as a simple game. A simple game has a number of properties. In a simple game, the value of a game is either a 1 or a 0. All players in a coalition or team that produce a 1 value have a Shapley Value of 1/r where r is the number of players in the team that can produce a win. In the TURF context, a consumer is reached by a subset of products. Those products all get a Shapley Value of 1/r where r is the number of products that are in that subset. All other products get a Shapley Value of 0. Another property of simple games is that they can be combined. In our TURF data, we treat each consumer as being a simple game. To combine the simple games represented by the consumers in our study, we calculate the Shapley Value for each product for each consumer and then average across consumers. We solve the problem of how to calculate the Shapley Value for TURF problems by considering the TURF game as a simple game. But we still are not sure what this “value” represents. For this we need to look at the problem from a marketing perspective. A SIMPLE MODEL OF CONSUMER BEHAVIOR Consider this simple model of consumer behavior: 1. A consumer plans to buy in the category and enters the store. 2. She reviews the products available and identifies a small subset (relevant set) that have the possibility of meeting her needs. 3. She randomly chooses a product from that subset. 268 Now clearly most of us are not explicitly using some random number generator in our heads to choose which product to buy when we visit the store. Instead we evaluate the products available and choose the one that maximizes our personal utility, that is, we choose the product we prefer . . . at that moment. The product that will maximize our utility depends upon several factors. One factor is the benefits that the particular product delivers. A second factor is the benefits delivered by other competing products that are available in the store. Benefits delivered are evaluated in the context of needs. If one has no need for a benefit then its utility is nonexistent. If one has a great need for a particular benefit then a product delivering that benefit will have a high utility and a good chance of being the utility maximizing choice. When we observe consumer purchases, for example by looking at data from a purchase panel, one can see that the specific products available, and their benefits, stay relatively constant, but nonetheless, consumers seem to buy different products on different trips to the store. This would seem to indicate that the driver of choice is the degree to which a person’s needs change from trip to trip. Hypothetically, we can map an individual’s needs to specific products that maximize utility when that need is present. This means that if we can observe the different products that a person purchases over some time period, then we can infer that those purchases are a result of the distribution of need states that exist for that consumer. If the distribution of need states for a specific consumer were such that the probability of choosing each product in the relevant set was equal then the purchase shares of each product would be the equivalent of the Shapley Value of each product. Therefore, we can think of the Shapley Value calculation as a simple choice model, where the probability of choosing a particular product is 0 for all products not in the relevant set and 1/r for all r products in the relevant set. An alternative to the Shapley Value calculation would be to estimate the specific probabilities of choosing each product using a multinomial logit discrete choice model. If, we can estimate the probabilities of purchase for each product for each consumer, then this should be a superior estimate of purchase shares since the probabilities estimated in this manner would not be arbitrarily equal for relevant products and would not be uniformly zero for non-relevant products. But, is it feasible, in the context of a consumer interview, to obtain enough choice data to accurately estimate those probabilities of purchase, especially if the product space is large? In addition, it is not possible in the course of a 20-minute interview to ask consumers to realistically make choices across multiple need states. APPLICATION OF THE SHAPLEY VALUE TO CONSUMER BEHAVIOR If we weight the consumers in our study by the relative frequency of category purchase and units per purchase occasion then the Shapley Value becomes directly a measure of share of units purchased. This moves the Shapley Value from being an interesting research technique to being a very useful business management tool. Anecdotally, we understand that category managers at retailers obtain a ranked sales report for their category and consider the items that make up the bottom 20% of volume to be candidates for delisting or being replaced in the store. Since the Shapley Value provides an estimate of the sales rate for each product (in any combination), we can create a more viable recommendation for a product line. Instead of choosing products that maximize “reach,” we can 269 use a dual rule of maximizing reach subject to the restriction that no products in the line fall into the bottom 20% of volume overall. To effectively do this analysis, one needs to collect data a little differently from TURF. In a typical TURF study one asks respondents to give some purchase interest measure to each of the prospective products that would go in the product line. A consumer is counted as “reached” if she provides a top-box response to the purchase interest question. The problem with this approach is two-fold. First, the questioning procedure is very tedious, especially as the number of products in your product line increases. For that very reason, competitive brands are not typically included. But, competitive brands are critical. Those are the products you want to replace on the retailer’s shelves. The Shapley Value analysis can show you which of the competitor’s products your proposed line should displace, but it can only do so if you have included the competitive products in your study. Our suggestion is to ask respondents which products, from the category, they have purchased in some limited time period. (The time period should be dependent on the general category frequency of purchase). This data can be used to calculate Shapley Values and optimize a product line if all we are considering are existing products in the marketplace. When considering new product concepts the problem is how to reliably determine if a new product would become part of a consumer’s relevant set. This is especially problematic since consumers are well known to overstate their interest in new product concepts. A method we have found effective is to ask the typical purchase intent question for the new product and supplement it by asking consumers to rank order the new concept amongst the other products they currently buy (i.e., the ones selected in the previous task). We count a new product as entering an individual consumer’s relevant set if, and only if, they rated it top box in purchase intent and they ranked it ahead of all currently bought products. In our experience, this procedure appears to produce reasonable estimates from the Shapley Value. (Since there is no actual sales data in these cases a true validation has not been possible). GOING BEYOND TURF—OTHER APPLICATIONS OF THE SHAPLEY VALUE Recall that the Shapley Value is a way of allocating the total value of a game to the participants in a fair manner. There are plenty of situations where we only know the total value of something but we want to understand how that value can be allocated to the components that create that value. One clear example is linear regression analysis. Here we want to understand the value that each predictor has in producing the overall value of the model. The overall value of the model is usually measured by the R2 value. Frequently we wish to allocate that overall R2 value to the predictors to determine their relative importance. In 2000, my colleague Stan was working with one method of evaluating the importance of predictors, the net effects. Net effects are a decomposition of the R2 defined as: where the betas are vectors of standardized regression coefficients and R is the correlation matrix of the predictor variables. The NE vector, when summed equals the R2 of the model. This particular decomposition of R2 is problematic when there is a high degree of multicollinearity amongst the predictors. In those cases there can often be a sign reversal in the beta coefficients 270 which can cause the net effect for that predictor to be negative. This makes the interpretation of the net effects as an allocation of the total predictive power of the model illogical. My experience with the Shapley Value caused me to wonder if the Shapley Value might be a solution to this problem. The Shapley Value is an allocation of a total value. The individual Shapley Values will therefore sum to that total value, and they will all be positive. We can easily (although less easily than the line optimization case) calculate the incremental value of each predictor across all combinations of predictors. In the Shapley Value equation we substitute for the value term the R2 of each model: i R n( s) S - all subsets 2 S RS2{i} This is no longer a simple game in the parlance of game theory so it becomes an np-hard problem again. But, for sets of predictors that are smaller than 30 it is a reasonable calculation on modern computers. I was convinced that this was an excellent idea. As is often the case with excellent ideas, it turned out that there were many others doing research in other fields who had also come up with essentially the same idea.345 Many other related techniques also appear in the literature. We did, however, take the approach one step further. Going back to the net effects decomposition discussed earlier we realized that both of these techniques, Net Effects and Shapley Value were trying to do the same thing: allocate the overall model R2 to the individual predictors. So, if we assume that the Shapley Values are approximations of the Net Effects then we can “reverse” the decomposition and calculate new beta coefficients so that they are as consistent as possible with the Shapley Values. This requires a non-linear solver but we can estimate a new set of beta coefficients that result in Net Effects that are very close to the Shapley Values. These new coefficients can then be used in a predictive model. Gromping and Landau have criticized this approach.6 We show in a rejoinder7 that in conditions of high multicollinearity, the model with the adjusted beta coefficients as described above does a better job of predicting new data than the standard OLS model. We do recommend only utilizing the adjusted coefficients in those extreme conditions. Of course, there are other decompositions of R2 in the literature besides the Net Effects decomposition. One decomposition, which was first described by Gibson8 and later rediscovered by Johnson9, decomposes the R2 as follows: 3 (Kruskal, 1987) (Budescu, 1993) 5 (Lindeman, Merenda, & Gold, 1980) 6 (Gromping & Landau, 2009) 7 (Lipovetsky & Conklin, 2010) 8 (Gibson, 1962) 9 (Johnson, 1966) 4 271 This produces two identical vectors of weights ω that when squared, sum to the R2 of the model. These can be interpreted as importance weights and are very close approximations to the Shapley Values. The advantage of using this approximation of the Shapley Values for importance is that this particular decomposition is not an np-hard problem like the Shapley Value calculation and therefore is much easier to compute with large numbers of predictors. MOVING ON FROM LINEAR REGRESSION—OTHER ALLOCATION PROBLEMS One of the nice things about the Shapley Value is that the “value function” is abstract. You can define value in any way that you want, turn the Shapley Value crank and output an allocation of that value to the component parts. Consider the customer satisfaction problem. The Kano theory of customer satisfaction10 suggests that different product benefits have different types of relationships to overall satisfaction. Graphic by David Brown—Wikipedia Identifying attributes that are “basic needs” or “must-be” attributes is critical in customer satisfaction research. These are the items that cause overall dissatisfaction if, and only if, you fail to deliver. The interesting thing about these attributes is that they are non-compensatory, that is, if you fail to deliver on any one of these attributes you will have overall dissatisfaction, no matter how well you perform on other attributes. 10 (Kano, Seraku, Takahashi, & Tsuji, 1984) 272 Standard linear regression driver model approaches clearly don’t work here. There are two issues, first a linear regression model is inherently compensatory, and second, the vast majority of the data is located in the upper right quadrant of the graph above. As a result, we construct a model like this: First—let , , represent customers dissatisfied with A,B,C . . . K respectively. Also let represent customers dissatisfied overall. We want to find a set of items such that In other words, dissatisfaction with A or B or C implies dissatisfaction overall. One way of evaluating this is by calculating the reach into . In other words, the percent of dissatisfied people that are dissatisfied with any item in the set. But, this cannot be the end of the calculation because we need to subtract from this the percent of people who are satisfied overall with but are dissatisfied with one of the items in the set. In other words we need to subtract the false positive rate. This statistic is known as Youden’s J11 and we can use it to evaluate any dissatisfaction model of the form noted above. In our case, we treat Youden’s J statistic as the “value” of the set of items. We can search for the set of items that maximizes Youden’s J and then use the Shapley Value calculation to allocate that value to the individual items.12 This provides a priority for improvement. SUMMARY Since we started using the Shapley Value in marketing research problems a decade and one half ago we have found it to be a very useful technique whenever we need to allocate a total value to component parts. In the case of line optimization it immediately generalizes to a reasonable model of consumer behavior making it an extremely useful business management tool. Other applications have also proved to be quite useful. Business management, after all, seems to be primarily about prioritization and the Shapley Value procedure provides a convenient way to prioritize the components of many business decisions when direct measures of value of those components are not available. 11 12 (Youden, 1950) (Conklin, Powaga, & Lipovetsky, 2004) 273 Michael Conklin REFERENCES Budescu, D. (1993). Dominance Analysis: a new approach tot he problem of relative importance in multiple regression. Psychological Bulletin, 114:542–551. Conklin, M., Powaga, K., & Lipovetsky, S. (2004). Customer Satisfaction Analysis: identification of key drivers. European Journal of Operational Research, 154: 819–827. Gibson, W. A. (1962). On the least-squares orthogonalization of an oblique transformation. Psychometrika, 11:32–34. Gromping, U., & Landau, S. (2009). Do not adjust coefficients in Shapley value regression. Applied Stochastic Models in Business and Industry. Johnson, R. M. (1966). The Minimal Transformation to Orthonormality. Psychometrika, 61–66. Kano, N., Seraku, N., Takahashi, F., & Tsuji, S. (1984). Attractive quality and must-be quality. Journal of the Japanese Society for Quality Control, 39–48. Kruskal, W. (1987). Relative Importance by Averaging over Orderings. The American Statistician, 41:6–10. Lindeman, R. H., Merenda, P. F., & Gold, R. Z. (1980). Introduction to Bivariate and Multivariate Analysis. Glenview, Il: Scott, Foresman. Lipovetsky, S., & Conklin, M. (2010). Reply to the paper “Do not adjust coefficients in Shapley value regression.” Applied Stochastic Models in Business and Industry, 26: 203–204. Shapley, L. S. (1953). A Value for n-Person Games. In e. A. H.W. Kuhn & A.W. Tucker, Contributions to the Theory of Games, Vol II. (pp. 307–17). Princeton, NJ: Princeton University Press. Youden, W. (1950). Index for rating diagnostic tests. Cancer, 3: 32–35. 274 DEMONSTRATING THE NEED AND VALUE FOR A MULTI-OBJECTIVE PRODUCT SEARCH SCOTT FERGUSON1 GARRETT FOSTER NORTH CAROLINA STATE UNIVERSITY ABSTRACT The product search algorithms currently available in Sawtooth Software’s Advanced Simulation Module focus on optimizing product line configurations for a single objective. This paper demonstrates how multi-objective product search formulations can significantly influence and form a design strategy. Limitations of using a weighted sum approach for multi-objective optimization are highlighted and the foundational theory behind a popular multi-objective genetic algorithm is described. Advantages of using a multi-objective optimization algorithm are shown to be richer solution sets and the ability to comprehensively explore tradeoffs between numerous criteria. Opportunities for enforcing commonality are identified, and the advantage of retaining dominated designs to accommodate un-modeled problem aspects is demonstrated. It is also shown how linking visualization and optimization tools can permit the targeting of specific regions of interest in the solution space and how a more complete understanding of the necessary tradeoffs can be achieved. 1. INTRODUCTION Suppose a manufacturer is interested in launching a new product. Richer product design problems driven by estimates of heterogeneous customer preference are possible because of advancements in marketing research and increased computational capabilities. However, when heterogeneous preference estimates are considered, a single ideal product for an entire market is not possible. Rather, a manufacturer must offer a product line to meet the diversity of the market. Initial steps toward launching a product line might involve a manufacturer identifying its manufacturing capabilities, contacting possible suppliers, determining likely cost structures, and benchmarking the market competition. To understand how potential customers might respond to different product offerings, a choice-based conjoint study can then be fielded (using SSI Web [1], for example) to survey thousands of respondents. Part-worths for the different product attribute levels are then estimated (using Sawtooth Software’s CBC/HB module [2], for example) and the task of determining the configuration of each product begins. Though armed with a wealth of knowledge about the market, the manufacturer may still be unsure of exactly which attribute combinations will create the best product line. Rather than assessing random product configurations, the manufacturer decides to use optimization to search for the best configuration. The standard form of an optimization problem statement is shown in Equation 1, where represents the objective function to be minimized. is the vector of design variables that defines the configuration of each product; represents possible 1 [email protected] 275 inequality constraints; represents the equality constraints, and the final expression describes lower and upper bounds placed on each of the n design variables . (1) 1.1 Setting up a single objective product search Sawtooth Software’s Advanced Simulation Module (ASM) [3] offers product search capabilities as part of SMRT. Information needed to conduct the product search includes: attribute levels to be considered for each product (the design variables) estimates of respondent part-worths attribute cost information the number of products to search for competing products/the “none” option size of the market To illustrate the limited information gained from a single objective optimization, consider the hypothetical design scenario of an MP3 player product line. As previously stated, one of the first steps associated with product line design is identifying the product attributes (and levels) considered. For this example, product attributes are shown in Table 1, and the cost of each attribute level is shown in Table 2. To solve this configuration problem, four products are to be designed, respondent part-worths are estimated using Sawtooth Software’s CBC/HB module, and the “None” option is the only competition considered. Overall product price is calculated by multiplying attribute cost by 1.5 and adding a constant base price of $52. The market size for this simulation is 10,000 people. 276 Table 1. MP3 Player product attributes considered DV X1 X2 X3 X4 X5 X6 X7 XP Level Photo/Video/Camera Web/App/Ped Input Screen Size Storage Background Color Background Overlay Price 1 None None Dial 1.5 in diag 2 GB Black 2 Photo only Web only Touchpad 2.5 in diag 16 GB White 3 Video only App only Touchscreen 3.5 in diag 32 GB Silver 4 Photo and Video Only Ped only Buttons 4.5 in diag 64 GB Red 5 Photo and Lo-res camera Web and App only 5.5 in diag 160 GB Orange $399 6 Photo and Hi-res camera App and Ped only 6.5 in diag 240 GB Green $499 7 Photo, Video and Lores camera Web and Ped only 500 GB Blue $599 8 Photo, Video and Hires camera Web, App, and Ped 750 GB Custom $699 No pattern / graphic overlay Custom pattern overlay Custom graphic overlay Custom pattern and graphic overlay $49 $99 $199 $299 Table 2. MP3 Player product attribute cost Level Photo/Video/Camera 1 2 Screen Size Storage Background Color Background Overlay $0.00 $0.00 $0.00 $0.00 $12.50 $22.50 $30.00 $35.00 $22.50 $60.00 $100.00 $125.00 $5.00 $5.00 $5.00 $5.00 $2.50 $5.00 $7.50 $150.00 $5.00 $175.00 $200.00 $5.00 $10.00 Web/App/Ped Input $0.00 $0.00 $0.00 3 4 5 $2.50 $5.00 $7.50 $8.50 $10.00 $10.00 $5.00 $20.00 $2.50 $20.00 $10.00 6 $15.00 $15.00 $40.00 7 8 $16.00 $21.00 $15.00 $25.00 The final step in setting up the optimization problem is defining the objective function. When using ASM, four options are available: 1) product share of preference, 2) revenue, 3) profit, and 4) cost. Choosing from among these four options, however, is not an easy task. In addition to preference heterogeneity, manufacturers must also respond to the challenges of conflicting business goals. For example, market share of preference provides no insight into profitability. A share increase could simply be achieved by lowering product prices. Conversely, it may be possible to increase profits by increasing product prices. While more money would be made per sale, this price increase will likely have a negative impact on share of preference. 1.2 Illustrating the limitation of a single objective search Suppose a manufacturer initially chooses profit as the objective to maximize. Sorted by price, the product configurations returned as the optimal solution are shown in Table 3. For this product line offering, share of preference is 82.16% and profit is $1.355 million. 277 Not wanting to make a decision without sampling different regions of the solution space, the manufacturer optimizes share of preference. The results from this optimization are shown in Table 4, where share of preference is 96.33% and profit is $1.070 million. Table 3. Optimal product line when maximizing profit Share of preference: Profit: 82.16% $1.355 million Product configurations Photo,video and hi-res camera Web and App Dial 1.5 in diag 32 GB Photo,video and hi-res camera Web and App Touchscreen 4.5 in diag Photo,video and hi-res camera Web, App, and Ped Touchscreen Photo,video and hi-res camera Web, App, and Ped Touchscreen Price Silver Custom pattern and graphic overlay $222.25 160 GB Silver Custom graphic overlay $391 4.5 in diag 500 GB Black Custom pattern and graphic overlay $469.75 6.5 in diag 750 GB Green Custom pattern and graphic overlay $529.75 Table 4. Optimal product line when maximizing share of preference Share of preference: Profit: 96.33% $1.070 million Product configurations Photo,video and hi-res camera Web and App Dial 1.5 in diag 16 GB Photo,video and hi-res camera Web and App Dial 4.5 in diag Photo,video and hi-res camera Web and App Touchscreen Photo,video and hi-res camera Web, App, and Ped Touchscreen Price Silver Custom pattern and graphic overlay $166 16 GB Silver Custom graphic overlay $207.25 4.5 in diag 16 GB Black Custom pattern and graphic overlay $241 6.5 in diag 32 GB Custom No pattern or graphic overlay $316 Configuration differences in product line solutions are shown by the shaded cells. While both solutions offer one product that is under $300, the remaining products in Table 3 are more expensive than any of those in Table 4. The most significant difference comes in the Storage attribute, where the products in Table 3 have increased storage sizes that significantly drive up product price. The increased per-attribute profit coming from storage in this solution offsets the decrease in overall share. To be more competitive with the None option and capture as much share as possible, the configurations in Table 4 are less expensive. Additionally, results in both tables suggest significant opportunities for enforcing commonality, a notion that will be explored further in Section 4. An ideal solution would simultaneously maximize both market share of preference and profit. However, the results in Tables 3 and 4 verify that profit and share of preference are conflicting objectives—that is, to increase one a sacrifice must occur in the other. Yet, as shown in Figure 1, the ability to describe the nature of this tradeoff is extremely limited. Without 278 additional information it is impossible to make any statements about the region between these points. This tradeoff leads to a variety of questions that must be answered: Is the tradeoff between the objectives linear? What is the right balance that should be achieved between these objectives? Are these the only objectives that should be considered? Figure 1. Comparing the results of the single objective optimizations 1.3 Posing a multi-objective optimization problem In problems with multiple competing objectives the optimum is no longer a single solution. Rather, an entire set of non-dominated solutions can be found that is commonly known as the Pareto set [4]. A solution is said to be non-dominated if there are no other solutions that perform better on at least one objective and perform at least as well on the other objectives. A solution vector is said to be Pareto optimal if and only if there does not exist another vector for which Equations 2 and 3 hold true. for i = 1 .. t (2) for at least one i, 1 < i < t (3) Building upon the formulation introduced in Equation 1, a multi-objective problem formulation is shown in Equation 4. In this equation, t represents the total number of objectives considered. (4) 279 While Sawtooth Software’s ASM does not currently support the simultaneous optimization of multiple objectives, the engineering community has frequently used such problem formulations to explore the tradeoffs between competing objectives [5–10]. Multidimensional visualization tools have also been created that facilitate the exploration of a large number of solutions and the ability to focus on interesting regions of the solution space [11–13]. This provides the opportunity for additional insights into the tradeoffs between objectives that can be especially helpful in the early stages of design when a product strategy is still being formed. The goal of this paper is to provide an introduction into how the set of non-dominated solutions can be found in a multi-objective problem and demonstrate the benefits of having this additional information. 2. FOUNDATIONAL BACKGROUND Finding the solution to a problem with multiple competing objectives requires a set of nondominated points to be identified. This is done by sampling possible solutions in the design space. In product line design problems, the design space is where product configurations are established. Definition: Design space—Referring back to Equation 4, the design space is an ndimensional space that contains all possible combinations of the design variables. Design variables can either be continuous or discrete. For the MP3 player example defined in the previous section, it is assumed that only discrete attribute levels are considered. This creates an n-dimensional grid that must be sampled to find the most effective design. A two-dimensional view of this grid is shown in Figure 2. Each point in the design space is a unique design. Evaluating the performance of a design point across all t objectives defines a specific one-to-one mapping to the performance space. Definition: Performance space—Quantifies the value of a combination of design variables (a design) with respect to each system objective. As shown on the right in Figure 2, the set of non-dominated solutions in the performance space leads to the identification of the Pareto set (often called the Pareto frontier in two dimensions). 280 Figure 2. Representing the design and performance space in multi-objective optimization 2.1 Locating the Pareto set using a linear combination of objectives A simple, popular, and well-known approach to finding the design configurations associated with the Pareto set is to convert the multi-objective optimization problem into a single objective convex optimization of the different objectives [14–16]. This is done using a weighted-sum approach, where the problem given in Equation 5 is solved: (5) In this formulation, the t weights are often chosen such that wi > 0 and . Solving Equation 5 for a specific set of weights creates a single Pareto optimal point. In some formulations, it is required that the weighting value be a positive number. This is because a weight value of zero can lead to a weakly dominated design. To generate several points in the Pareto set, an even spread of the weights can be sampled. However, research has indicated several limitations to this strategy [17–21], including that: non-dominated solutions in non-convex regions of the Pareto set cannot be generated, regardless of weight granularity an even spread of weights does not produce an even spread of Pareto points in the performance space it can be difficult to know the relative importance of an objective a priori objectives must be normalized before weights are defined such that the weighting parameter is not merely compensating for differences in objective function magnitude the weighted objective function is only a linear approximation in the performance space it is not computationally efficient to solve Equation 5 multiple times (once for each set of weights considered) 281 Difficulties of generating non-convex points The inability to generate non-convex solutions of the Pareto set can be graphically illustrated when two objectives are considered [18]. Here, Equation 5 simplifies to: (6) where w is constrained between 0 and 1 in Equation 6. An equivalent formulation is a trigonometric linear combination as shown in Equation 7, where the scalar varies between 0 and /2. Since these formulations are equivalent, a non-dominated solution can be found when and when . Also, if a non-dominated solution cannot be found using Equation 7, then it cannot be obtained using any convex combination of two objectives. (7) If the axes that define the performance space are rotated counterclockwise by an angle , the rotated axes are given by Equation 8. From Figure 3, minimizing is equivalent to translating the axis parallel to itself until it intersects the non-dominated solution set. This intersection point is the Pareto point P. Solving the problem given by Equation 7 for all values of explores different axis rotations by varying the slope of the tangent from 0 to while maintaining contact with the non-dominated frontier. (8) Figure 3. Obtaining a Pareto point by solving the trigonometric linear combinations problem (adapted from [18]) If all of the objective functions and constraints for a multi-objective optimization problem are convex—the Hessian matrix is positive semi-definite or positive definite at all points on the set—then the weighted sum approach can locate all Pareto optimal points. However, consider the scenario presented by Figure 4. For line segment the slope of the tangent touches the Pareto 282 frontier at two distinct locations. As the slope of is rotated to become less negative, Pareto frontier points in the region of are located. Rotations making the slope of more negative identify Pareto points in the region of . Since there are no rotations capable of identifying solutions in the portion of the Pareto frontier, the non-convex region would not be found. Figure 4. Illustrating the inability to locate non-convex portions of the Pareto frontier using trigonometric linear combinations (adapted from [18]) Difficulties of generating an even spread of Pareto points Even if the objective functions and constraints are convex, an even spread of weight values does not guarantee an even spread of Pareto points in the performance space. Rather, solutions will often clump in the performance space and provide very little information about the possible tradeoffs elsewhere. To demonstrate this challenge, consider the multi-objective optimization problem given in Equation 9. (9) A set of 11 Pareto points was found for this problem by varying w from 0 to 1 in even increments of 0.1. Figure 5 shows these points plotted in the performance space. In this figure, only 9 Pareto points are visible, as values of 0.8, 0.9 and 1 for w yield the same solution of (F1, F2) = (10, -4.0115). There is significant clustering for solutions obtained at small values of w (where the weighted function primarily focuses on objective F1). Further, there is a noticeable gap in the frontier that occurs between weighting values of 0.6 and 0.7. 283 Figure 5. Illustrating uneven distribution of Pareto points despite an even distribution of the weight parameter Addressing the remaining challenges Beyond locating points on the Pareto frontier, defining the weights themselves can pose a significant challenge. One issue that must be addressed before implementing a weighted sum approach is ensuring a comparable scale for the objective function values. If these values are not of the same magnitude, some of the weights may have an insignificant impact on the weighted objective function that is being minimized. Thus, all objective functions should be modified in such a way that they have similar ranges. However, as noted by Marler and Arora [21], when exploring solutions to a multi-objective problem and using weights to establish tradeoffs the objective functions should not be normalized. This requires an extra computational step, as the objectives must be first transformed for the optimization and then transformed back when presenting solutions. Further, the formulation presented in Equation 5 assumes a linear “preference” relationship between objectives. This has led researchers to explore more advanced non-linear relationships between objective functions [22]. Finally, solving for the Pareto point associated with a given weight combination requires an optimization to be conducted. If 100 points are desired for the Pareto set, 100 optimizations must be run. For computationally expensive problems, this can prove to be burdensome and challenging. In response to these issues associated with the weighted sum approach, researchers have explored modifications to existing heuristic optimization approaches that allow for more efficient and effective identification of the Pareto set. The next section discusses one extension of genetic algorithms for multi-objective optimization problems. 2.2 Multi-objective genetic algorithms (MOGAs) A number of multi-objective evolutionary algorithms have been proposed in the literature [23–30], mainly because they are capable of addressing the many limitations associated with the weighted sum approach. Further, the engineering community has frequently used the results from 284 multi-objective problem formulations to explore the tradeoffs between competing objectives. In this “design by shopping” paradigm [31], multidimensional visualization tools are used to explore a large number of alternative solutions and allow interesting regions of the space to be selected for further exploration. In support of this goal, multi-objective genetic algorithms provide the ability to find multiple Pareto optimal points in a single run. At its most basic level, the foundation of a genetic algorithm can be described by five basic components [32]. These components are: a genetic representation of possible problem solutions Several methods of encoding solutions exist, such as binary encoding, real number encoding, integer encoding, and data structure encoding. a means of generating an initial population A random initial population is typically created that covers as much of the design space to ensure thorough exploration. techniques for evaluating design fitness Design fitness describes the goodness of a possible solution. It is used to quantify the difference in solution performance or provide a ranking of the designs. genetic operators capable of producing new designs using previous design information Selection, crossover and mutation are the three primary genetic operators. Selection is used to determine which designs will produce offspring. Crossover is used to represent a mating process and mutation introduces random variation into the designs. parameter settings for the genetic algorithm These parameters control the overall behavior of the genetic search. Examples include population size, convergence criteria, crossover rate, and mutation rate. In accommodating problems with multiple performance objectives, it is necessary to modify the form of the fitness function used to assess a design. There are many reasons for this. First, in the absence of additional information, it is impossible to say that one Pareto point is better than another. As shown in the previous section, linear combinations of the objective functions suffer from multiple limitations. Second, in the presence of constraints a common optimization procedure is to apply a penalty function to signify infeasibility. However, when multiple objectives are considered, it is not clear to which objective the penalty should be applied. 285 While multiple variations of multi-objective genetic algorithms exist, this paper focuses on one of the more popular variants: NGSA-II [26]. Readers interested in a more thorough treatment of advancements in multi-objective genetic algorithms are directed to [33, 34]. NSGA-II is primarily characterized by its fast non-dominated sorting procedure, diversity preservation, and elitism, each discussed in detail below. Similar to a single objective genetic algorithm, the first step is to create an initial population of designs. The members of this initial population can either be created randomly or using targeted procedures [35–37] to improve computational efficiency and improve solution quality. Fitness of a design is defined by its non-domination level, as shown in Figure 6. Assuming minimization of performance objective, smaller values of the non-domination level correspond to better solutions, with 1 being the best. Figure 6. Representation of ranked points in the performance space Non-dominated sorting procedure Defining the non-domination level for a design p begins by determining the number of solutions that dominate it and the set of solutions (Sp) that p dominates. By the principle of Pareto optimality, designs with a domination level of 1 start with their domination count at 0. That is, no designs dominate them. Now, for each design in the current domination level the domination count of each solution in Sp is reduced by one. If any of these points in Sp are now non-dominated, they are placed into a separate list. This separate list represents the next nondomination level. This process continues until all fronts (non-domination levels) are identified. Diversity preservation Diversity preservation is significant for two reasons. First, it is desired that the final solution is uniformly spread across the performance space. Second, it provides a tie-breaker during selection when two designs have the same non-domination rank. In NSGA-II, the crowding distance around each design is calculated by determining the average distance of two points on either side of this design along each of the objectives. To ensure that this information can be aggregated, each objective function is normalized before calculating the crowding distance. A solution with a larger value of this measure is “more unique” than other solutions, suggesting that the space around this design should be explored further. 286 Crowding distance is then used as a secondary comparison between two designs during selection. Assume that two designs have been chosen. The first comparison between the designs is the non-domination rank. If one design has a lower non-domination rank than the other design, the design with the lower non-domination rank is chosen. When the non-domination rank is the same for both points the crowding distance measure is considered, and the design with the larger value for crowding distance is chosen. This allows designs that exist in less crowded regions of the performance space to be chosen and helps ensure a more even spread of Pareto points in the final solution. Elitism When a group of children designs are created, they are combined with the original parent population. This creates a population with a size of 2N, where N is the size of the original population. To reduce the population size back to N all designs are sorted with respect to their non-domination ranking and then with respect to their crowding distance. The top N designs are then chosen to be the parent population for the next iteration. A flowchart describing the NSGA-II algorithm is shown in Figure 7. The next section revisits the example problem originally proposed in Section 1 to explore the solution obtained when using a multi-objective genetic algorithm. Advantages of this approach are also discussed. Figure 7. Describing the NSGA-II algorithm 3. SETTING UP THE MULTI-OBJECTIVE PRODUCT LINE SEARCH This section expands upon the example presented in Section 1 by setting up the problem as a multi-objective product line optimization. As before, two objective functions will be considered. Four products are to be designed in the line, and the full multi-objective problem formulation is given by Equation 10. The first objective is maximizing the market share of preference (SOP) captured by the product line. The outer summation combines the share of preference captured by each product. The numerator’s outermost summation combines the probability of purchase for all respondents ( ) for the current product ( ) before dividing by the number of respondents. Respondent j’s 287 probability of purchase is calculated by dividing the exponential of a product’s observed utility ( ) by the sum of the exponentials of the other products, including the part-worth associated with the “None” option. The second objective is profit. Profit of a product line can be approximated using contribution margin per person in the market (i.e., per capita), or per capita contribution margin (PCCM). To combine the margin of the four products in the line, a weighting scheme must be constructed using the share of preference of each individual product ( ). This ensures that a product with high margin and low sales does not artificially inflate the metric. PCCM can also be used to estimate the aggregate contribution margin of a product line by multiplying PCCM by the market size. Maximize: (10) by changing: Feature content with respect to: No identical products in the same product line Lower and upper level bounds on each attribute (Xjk) The optimization problem formulated in Equation 10 consists of 28 design variables—4 products with 7 attributes per product, as shown in Figure 8. To identify the non-dominated points for this problem formulation, a multi-objective genetic algorithm was fielded using an initial population initialized using Latin hypercube sampling. A listing of relevant MOGA parameters is given in Table 5. The MOGA used in this paper was coded in Matlab [38], and was an extension of the foundational theory presented in Section 2.2. Figure 9 depicts the location of the frontier when the stopping criterion of 600 generations was achieved. Figure 8. Illustration of design string 288 Table 5. Input parameters for the MOGA Criteria Initial population size Offspring created within a generation Selection Crossover type Crossover rate Mutation type Mutation rate Stop after Setting 280 (10 times the number of design variables) 280 (equal to original population size) Tournament (4 candidates) Scattered 0.5 Adaptive 5% per bit 600 generations Figure 9. Set of non-dominated solutions after 600 generations The plot shown in Figure 9 illustrates the additional solutions that are found when the product line design problem is solved using a multi-objective genetic algorithm. The next section of the paper explores the advantages—in terms of information available, insights gained, and user interaction—that is possible because of this approach. 4. USING MULTI-OBJECTIVE OPTIMIZATION RESULTS TO GUIDE MARKET-BASED DESIGN STRATEGIES Formulating and solving a multi-objective product line design problem is the first step in defining a design strategy. This section illustrates how information from the solution can influence the choice of design strategy and problem formulation. By walking through an example of how such data might be analyzed, it is shown that product architecture insights can be 289 gathered by considering the entire set of non-dominated solutions, and that dominated solutions can be explored to accommodate preferences associated with un-modeled objectives. If this causes the scope of the multi-objective problem to be expanded, the space can be explored using interactive multidimensional visualization tools. Visualizing the non-dominated set allows areas of interest to be identified in “real-time” and solutions populated in those areas of the space. 4.1 Deriving product architecture insights from the non-dominated set of solutions The increased information available in Figure 9 may make it difficult for a manufacturer to select a solution. Consider that the manufacturer selects four non-dominated points that appear interesting. As shown in Figure 10, the “Max. share” solution is chosen because it maximizes share of preference. The “Max. profit” solution is chosen because it maximizes profit. The “Profit trade” solution is chosen because it gains a significant increase in share (almost 8%) while sacrificing very little in profit. Finally, the “Share trade” solution is chosen because it gains profit with a very small decrease in market share (about 1%). Figure 10. Selecting four of the non-dominated solutions Product configurations for the four solutions are shown in Table 6. Also shown is the calculated share of preference and profit for a market size of 10,000 people. From this information, the manufacturer can then compare the properties of a solution. For instance, the “Profit trade” solution is the only product line to offer “Buttons” as an input, and it only occurs in one of the products. The share-focused solutions are configured with very small storage options compared to that of the profit-focused solutions. This supports the insight from Tables 3 and 4 that suggested smaller storage sizes are used to capture share (because they are less expensive products that can compete against the None), while larger storage options generate more overall profit (at the expense of share of preference). 290 Table 6. Product line configurations for the four chosen solutions 96.33% $1.070 million Share trade solution 95.38% $1.198 million Photo/Video/Camera Photo, video, and hi-res camera Photo, video, and hi-res camera Photo, video, and hi-res camera Photo, video, and hi-res camera Web/App/Ped Web and app only Web and app only Web and app only Web and app only Dial Dial Buttons Dial 1.5 in diag. 1.5 in diag. 1.5 in diag. 1.5 in diag. 16 GB 16 GB 16 GB 32 GB Max. share solution Product line Share of preference Profit Input Product 1 Screen Size Storage Background Color 88.42% $1.348 million 82.16% $1.355 million Silver Silver Silver Silver Custom pattern and graphic Custom graphic Custom pattern and graphic $166 $166 $177.25 $222.25 Photo/Video/Camera Photo, video, and hi-res camera Photo, video, and hi-res camera Photo, video, and hi-res camera Photo, video, and hi-res camera Web/App/Ped Web and app only Web and app only Web and app only Web and app only Dial Touchscreen Touchscreen Touchscreen 4.5 in diag. 4.5 in diag. 4.5 in diag. 4.5 in diag. Storage 16 GB 16 GB 160 GB 160 GB Background Color Silver Silver Silver Silver Custom graphic Custom graphic Custom graphic Custom graphic $207.25 $237.25 $391 $391 Photo/Video/Camera Photo, video, and hi-res camera Photo, video, and hi-res camera Web/App/Ped Web and app only Web and app only Touchscreen Touchscreen Photo, video, and hi-res camera Web, app, and ped Touchscreen Photo, video, and hi-res camera Web, app, and ped Touchscreen 4.5 in diag. 4.5 in diag. 4.5 in diag. 4.5 in diag. 16 GB 16 GB 500 GB 500 GB Black Price Input Screen Size Background Overlay Price Input Product 3 Max. profit solution Custom pattern and graphic Background Overlay Product 2 Profit trade solution Screen Size Storage Background Color Background Overlay Price Silver Custom Black Custom pattern and graphic Custom graphic Custom pattern and graphic Custom pattern and graphic $235.5 $237.25 $484.75 $469.75 291 Photo/Video/Camera Web/App/Ped Input Product 4 Photo, video, and hi-res camera Web, app, and ped Touchscreen Photo and video only Web, app, and ped Touchscreen Photo, video, and hi-res camera Web, app, and ped Touchscreen Photo, video, and hi-res camera Web, app, and ped Touchscreen 6.5 in diag. 4.5 in diag. 6.5 in diag. 6.5 in diag. 32 GB 64 GB 750 GB 750 GB Custom Black Green Green No pattern or graphic Custom pattern and graphic Custom pattern and graphic Custom pattern and graphic $316 $337 $529.75 $529.75 Screen Size Storage Background Color Background Overlay Price Similar to the results in Tables 3 and 4, the product configurations in Table 6 show a significant degree of commonality. To get a better understanding of the solution space, the manufacturer explores (i) how many unique product configurations exist, and (ii) how many different attribute levels are used. Figure 4 has 71 solutions, and each solution has 4 products per line, meaning that there are 276 total product configurations in the solution set. Of these 276 products, only 47 unique configurations exist. The breakdown of attribute usage in these 47 products is shown in Table 7. Table 7. Breakdown of attribute usage in the 47 unique product configurations Attribute Level Photo/Video/Camera Web/App/Ped Input 1 2 3 4 5 6 7 8 0 1 0 4 0 0 1 41 0 0 1 0 22 1 0 23 7 2 36 2 Screen Size 8 0 1 24 1 13 Storage Background Color Background Overlay 0 12 4 6 12 0 4 9 14 0 16 1 1 7 0 8 1 2 15 29 The darkest cells represent attribute levels used in over 20% of the products. Examples include the “Photo, Video, and Hi-Res Camera” option (in 41 of the 47 products) and a “Touchscreen” input (in 36 of the 47 products). The remaining shaded cells are used in at least one product, such as the “Photo only” option (in 1 of the 47 products). There are 13 product attribute levels (28.26% of the total number of attribute levels) that are never used. For product attributes where one level is primarily used, the manufacturer may want to consider product platforming possibilities [39]. By making this attribute a key element of the product line’s architecture, the manufacturer may realize cost savings that permit variety to be offered in the other attributes. For example, it appears that there are few business advantages of offering an MP3 player without the ability to play photos and videos and capture content using a hi-resolution camera. Yet, the solution set—and thereby the market—is not nearly as homogeneous when it comes to storage size. By platforming around the first attribute, solutions from the non-dominated set can be explored that have multiple storage sizes. By doing this, a manufacturer can strategically offer product variety in a way that captures different groups in the respondent market. 292 4.2 Influencing design strategy using dominated designs Suppose that after considering the trade ramifications between business goals the manufacturer has identified a favorite solution from Figure 4. As shown in Figure 6, this solution captures 93.452% share of preference and returns a profit of $1.261 million. Since this is a zoomed in version of Figure 10, the “Share trade” solution is also identified to give perspective. Figure 11. Improving commonality by considering dominated solutions Now, assume that the manufacturer (after interpreting the results in Tables 3, 4 and 7) is also interested in the commonality associated with the product line solution. Commonality often has a benefit due to tooling cost and inventory savings. The commonality index (CI) was introduced by Martin and Ishii [40,41] as a measure of unique parts, and can be calculated using Equation 11. (11) Here, u is the total number of distinct components, mi represents the number of components used in variant i, and n is the number of variants in the product line. CI ranges from 0 to 1 where a smaller value indicates more unique parts. The CI for the current solution is 0.4762. The manufacturer likes this solution with respect to the business goals, but would also be interested in a solution with increased commonality to potentially realize greater cost savings. Recall that the non-dominated set is comprised of Rank 1 solutions (triangles). Moving a Rank 1 solution to the right (increasing share of preference) corresponds to a solution with a CI value of 0.4762. Moving a Rank 1 solution to the left (decreasing share of preference) corresponds to a solution with a CI value of 0.4762—or no change in CI. Not wanting to deviate too far from the current solution, the manufacturer uses the MOGA data to recall the Rank 2 solutions. These solutions are only dominated by Rank 1 solutions, as discussed in Section III.C. Exploring these Rank 2 solutions, the manufacturer finds a product 293 line with a CI of 0.5714. This Rank 2 product line configuration, also highlighted in Figure 11, has a share of preference of 93.2945% and a profit of $1.2585 million. By keeping a rank listing of the final population from—or all points evaluated by—the MOGA, the manufacturer has a degree of design freedom with which to explore the solution space. 4.3 Using n-dimensional problem formulations to drive product strategy Considering dominated designs is one strategy for navigating the solution space while trying to minimize the trades made between conflicting objectives. However, it is not guaranteed that the solution in Figure 11 is optimal across all three objectives (maximize share of preference, maximize profit, maximize commonality). Since multi-objective problem formulations allow for a nearly “infinite” number of objectives to be considered simultaneously, the manufacturer may want to restructure the problem to consider all three objectives. Equation 12 shows the extension of the problem originally posed in Equation 10. Maximize: (12) by changing: Feature content with respect to: No identical products in the same product line Lower and upper level bounds on each attribute (Xjk) Including additional objectives in the problem formulation allows for a focus to be placed on various business objectives. For example, a manufacturer may want to decide on a product strategy by maximizing the penetration of a particular attribute level. A focus could also be placed on maximizing the probability of purchase for a specific demographic. Other possible examples include designing for age groups, genders, or geographical location. However, increasing the size of the optimization problem does not come without a cost. Adding objectives increases computational expense and the difficulty associated with navigating the solutions for insights. Therefore, strategic choices of problem objectives must be made to balance computational expense, the magnitude of information reported, and variety of business goals. Efforts to develop effective tradespace exploration tools [11–13_ENREF_64] facilitate multidimensional visualization, filtering of unwanted solutions, and detailed exploration of interesting regions of the solution space. Tradespace exploration tools, like Penn State’s ARL Trade Space Visualizer (ATSV) [11, 12], can also be linked to optimization algorithms to enable real-time user interaction and design steering. When the number of objectives or criteria considered goes beyond four or five, visualization tools become significantly less effective. Technical feasibility models have been developed that create parametric representations of the 294 non-dominated frontier. This work has demonstrated that the feasibility of a desired set of business goals could be tested in the engineering domain and that the necessary product configurations capable of meeting these goals could be quickly determined [10_ENREF_25, 42]. Since this problem only has three objectives, the manufacturer can use software like ATSV to visualize the non-dominated set of solutions. Figure 12 shows a three-dimensional glyph plot of the non-dominated solution set. Color has been added to this figure to show the various levels of commonality. ATSV would allow the manufacturer to explore this three dimensional plot by rotating, zooming and panning. This would allow color to be used to display information about another aspect of the problem. However, color was chosen here to show commonality due to the perspective challenges associated with the printed page. As expected, an increase in commonality sacrifices performance with respect to share of preference and profit. However, possible cost savings from increased commonality are not included in the problem formulation. This could be added with more advanced costing models. With the information provided, the manufacturer explores the available solutions and chose the one that best reflects the acceptable trade between objectives, as shown by the highlighted design. Figure 12. Non-dominated solutions for the three objective problem formulation Table 8 shows the configuration of the product line, sorted by price. Commonality in this solution is larger than the previous solutions selected by the manufacturer. The lightly shaded cells reflect configuration changes that are included in more than one product. Darker cells reflect configuration changes that are unique to a single product. In the first product, for example, the configuration change (1.5 in diagonal screen) is done to reduce cost. For the more expensive product, two unique changes exist that increase product price but provide variety not offered by the other products in the line. 295 Table 8. Product line chosen from three objective problem formulation Share of preference: Profit: Commonality index: 96.186% $1.0705 million 0.6667 Product configurations Price Photo,video and hi-res camera Web and App Dial 1.5 in diag 16 GB Silver Custom pattern and graphic overlay $166 Photo,video and hi-res camera Web and App Dial 4.5 in diag 16 GB Silver Custom graphic overlay $207.25 Photo,video and hi-res camera Web, App, and Ped Touchscreen 4.5 in diag 16 GB Black Custom pattern and graphic overlay $241 Photo,video and hi-res camera Web and App Touchscreen 6.5 in diag 32 GB Silver Custom graphic overlay $308.50 4.4 Targeting areas of the solution space in areas of interest While the non-dominated solutions in Figure 12 do provide significantly more information about the tradeoff between share of preference, profit, and commonality, there are still regions of the solution space where no tradeoff information is present. These “gaps” [42_ENREF_67] in the non-dominated set may exist because the algorithm has not yet found points in that region of the performance space, or because no solutions are possible in that region. Exploring the results in Figure 12, the manufacturer notices that there is one such gap directly “below” the product line solution shown in Table 8. Curious as to whether product line solutions exist in that region of the space, the manufacturer can modify the problem formulation posed in Equation 12. As shown in Equation 13, adding side constraints to the values of the objective function provides a region of attraction for the algorithm to target. The MOGA then can be re-initialized to pursue solutions that exist within these domains of attraction. 296 Maximize: (13) by changing: Feature content with respect to: No identical products in the same product line Lower and upper level bounds on each attribute (Xjk) When used in conjunction with multidimensional visualization tools, these attractors [12_ENREF_65] can be placed in “real-time” as the algorithm progresses toward the final set of non-dominated points. This allows a manufacturer to explore regions of the space deemed interesting and directly supports the original concept of the “design by shopping” paradigm [31_ENREF_54] where manufacturers play an active role in arriving at the final answer. To fill the gap in this frontier, the manufacturer places upper and lower bounds on share of preference (95.85% to 96.11%) and profit ($1.099 million to $1.137 million). The MOGA is re-run with these constraints in place to find additional solutions. The result of this optimization are shown in Figure 13. The glyph plot shows that the MOGA was able to find product line solutions that existed “below” the currently selected design. In this figure, the size of the boxes scales with the value of commonality index. Large boxes represent a CI value closer to 1, while smaller boxes represent a CI value closer to 0. The manufacturer now explores these new solutions to see if one of them better meets the business goals than the previously selected design configurations. Finding a solution more in line with the desired business outcomes—with increased profit and commonality at a small share of preference penalty—the manufacturer selects this new solution, detailed in Table 9. 297 Figure 13. Glyph plot showing additional solutions found using an attractor Table 9. Optimal product line when maximizing profit Share of preference: Profit: Commonality index: 95.991% $1.118 million 0.714286 Product configurations Photo,video and hi-res camera Web and App Dial 1.5 in diag 16 GB Photo,video and hi-res camera Web and App Dial 4.5 in diag Photo,video and hi-res camera Web and App Touchscreen Photo,video and hi-res camera Web, App, and Ped Touchscreen Price Silver Custom pattern and graphic overlay $166 16 GB Silver Custom graphic overlay $207.25 4.5 in diag 16 GB Black Custom pattern and graphic overlay $233.5 4.5 in diag 160 GB Black Custom graphic overlay $391 The shaded cells in Table 9 show where this selected solution differs in product configuration from the solution selected in Table 8. The first two products are identical. The third product removes the pedometer, only offering web and app access. This reduces product price by $7.50. The most significant changes come in the fourth product. Here, the pedometer is added on, the screen size is reduced (to 4.5 inches from 6.5 inches), storage size is increased (160 GB from 32 GB), and color is changed to black from silver. Finally satisfied with the product solution found after leveraging market information to evaluate the solutions and using multidimensional visualization to explore the space, the manufacturer locks down the solution and begins production. 298 5. CONCLUSIONS AND FUTURE WORK Multi-objective optimization algorithms and multidimensional visualization tools have been developed and used by the engineering design community over the last 25 years. The objective of this paper was to demonstrate how these tools and technologies could be extended to marketdriven product searches. Toward this goal, Section 1 introduced a case study problem centered around the design of an MP3 player product line. Using part-worths estimated from Sawtooth Software’s CBC/HB module, the Advanced Search Module within SMRT was used to optimize product lines around a single objective. Results from these optimizations showed that the two objectives considered—share of preference and profit—were in conflict. More importantly, no information was available about the tradeoff that existed between these objectives except for knowledge about the two solutions that comprised the “endpoints” of the solution space. Section 2 introduced the weighted sum approach—an easy and extremely common approach—to solving problems with multiple objectives. However, the many limitations of this approach were also highlighted. Significant limitations are the inability to generate solutions in the non-convex region of the non-dominated set, the inability to create solutions with an even spread in the performance space, and the computational expense associated with finding many non-dominated solutions. To address these shortcomings, the foundational background for a popular multi-objective genetic algorithm—NSGA-II—was presented in Section 2.2. This algorithm addresses many of the limitations of the weighted sum approach while simultaneously encouraging solution diversity and maintaining elitism. A solution to the multi-objective problem was introduced in Section 3 using the multiobjective genetic algorithm. This led to a discussion in Section 4 of advantages of conducting a product search using a multi-objective problem formulation. An increased number of solutions— from 2 to 71—provided enormous amounts of additional insight into the problem. It was shown how the solution set of the 71 products was comprised of only 47 unique products, and that there were significant opportunities for enforcing commonality or eliminating attribute levels with few negative ramifications. The advantage of retaining dominated designs from the final population was also shown to have merit when unmodeled criteria were considered. Here, commonality within the product line could be increased by selecting a dominated design with minimal losses to the business objectives while staying in the desired region of the performance space. Section 4 also highlighted how exploration of the solution space could be enabled by interactive multidimensional visualization tools. This technology enables both the user to direct attention to specific regions of interest in the performance space while also accommodating direct comparisons between non-dominated solutions. Research challenges arising from this work include understanding the progression of the non-dominated solutions in the performance space as new technologies, product attributes, and competition are introduced. Further, there is a need to understand how the selected product design strategy changes over time. A manufacturer might initially adopt a product strategy designed to capture a large initial market foothold. Over time, however, the goals of the manufacturer may change to focusing on profit while maintaining some aspect of market share. Future work could look to model and capture this information. ACKNOWLEDGEMENTS The authors gratefully acknowledge support from the National Science Foundation through NSF CAREER Grant No. CMMI-1054208 and NSF Grant CMMI-0969961. Any opinions, 299 findings, and conclusions presented in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation. Scott Ferguson REFERENCES [1] Sawtooth Software, 2008, “CBC v.6.0,” Sawtooth Software, Inc., Sequim, WA, http://www.sawtoothsoftware.com/download/techpap/cbctech.pdf. [2] Sawtooth Software, 2009, “The CBC/HB System for Hierarchical Bayes Estimation Version 5.0 Technical Paper,” Sawtooth Software, Inc., Sequim, WA, http://www.sawtoothsoftware.com/download/techpap/hbtech.pdf. [3] Sawtooth Software, 2003, “Advanced Simulation Module for Product Optimization v1.5 Technical Paper,” Sequim, WA. [4] Pareto, V., 1906, Manuale di Econòmica Polìttica, Società Editrice Libràia, Milan, Italy; translated into English by A. S. Schwier, as Manual of Political Economy, Macmillan, New York, 1971. [5] Tappeta, R. V., and Renaud, J. E., 1997, “Multiobjective Collaborative Optimization,” Journal of Mechanical Design, 119(3): 403–412. [6] Coello, C. A. C., and Christiansen, A. D., 1999, “MOSES: A Multiobjective Optimization Tool for Engineering Design,” Engineering Optimization, 31(3): 337–368. [7] Narayanan, S., and Azarm, S., 1999, “On Improving Multiobjective Genetic Algorithms for Design Optimization,” Structural Optimization, 18(2–3): 146–155. [8] Marler, R. T., and Arora, J. S., 2004, “Survey of Multi-Objective Optimization Methods for Engineering,” Structural and Multidisciplinary Optimization, 26(6): 369–395. [9] Mattson, C., and Messac, A., 2005, “Pareto Frontier Based Concept Selection Under Uncertainty, With Visualization,” Optimization and Engineering, 6(1): 85-115. [10] Gurnani, A., Ferguson, S., Donndelinger, J., and Lewis, K., 2005, “A Constraint-Based Approach to Feasibility Assessment in Conceptual Design,” Artificial Intelligence for 300 Engineering Design, Analysis and Manufacturing, Special Issue on Constraints and Design, 20(4): 351-367. [11] Stump, G., Yukish, M., Martin, J., and Simpson, T., 2004, “The ARL Trade Space Visualizer: An Engineering Decision-Making Tool,” 10th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, Albany, NY, AIAA-2004-4568. [12] Stump, G. M., Lego, S., Yukish, M., Simpson, T. W., and Donndelinger, J. A., 2009, “Visual Steering Commands for Trade Space Exploration: User-Guided Sampling with Example,” Journal of Computing and Information Science in Engineering, 9(4): 044501:1– 10. [13] Daskilewicz, M. J., and German, B., J., 2011, “Rave: A Computational Framework to Facilitate Research in Design Decision Support,” Journal of Computing and Information Science in Engineering, 12(2): 021005:1–9. [14] Eckenrode, R. T., 1965, “Weighting Multiple Criteria,” Management Science, 12: 180– 192. [15] Arora, J, S., 2004, Introduction to Optimum Design—2nd edition, Academic Press. [16] Rao, S., 2009, Engineering Optimization: Theory and Practice—4th edition, Wiley. [17] Athan, T. W., and Papalambros, P. Y., 1996, “A Note on Weighted Criteria Methods for Compromise Solutions in Multi-objective Optimization,” Engineering Optimization, 27: 155–176. [18] Das, I., and Dennis, J. E., 1997, “A Closer Look at Drawbacks of Minimizing Weighted Sums of Objectives for Pareto Set Generation in Multicriteria Optimization Problems,” Structural Optimization, 14: 63–69. [19] Chen, W., Wiecek, M. M., and Zhang, J., 1999, “Quality Utility—A Compromise Programming Approach to Robust Design,” Journal of Mechanical Design, 121: 179–187. [20] Messac, A., and Mattson, C. A., 2002, “Generating Well-distributed Sets of Pareto Points for Engineering Design Using Physical Programming,” Engineering Optimization, 3: 431– 450. [21] Marler, R. T., and Arora, J. S., 2010, “The Weighted Sum Method for Multi-objective Optimization: New Insights,” Structural and Multidisciplinary Optimization, 41(6): 853–62. [22] Kim, I. Y., and de Weck, O. L., 2005, “Adaptive Weighted-Sum Method for Bi-objective Optimization: Pareto Front Generation,” Structural and Multidisciplinary Optimization, 29(2): 149–58. [23] Fonseca, C. M., and Fleming, P. J., 1998, “Multiobjective Optimization and Multiple Constraint Handling with Evolutionary Algorithms—Part I: A Unified Formulation,” IEEE Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 28(1): 38–47. [24] Zitzler, E. and Thiele, L., 1999, “Multi-objective Evolutionary Algorithms: A Comparative Case Study and the Strength Pareto Approach,” IEEE Transactions on Evolutionary Computation, 3(4): 257–271. 301 [25] Zitzler, E., Deb, K., and Lothar, T., 2000, “Comparison of Multi-objective Evolutionary Algorithms: Empirical Results,” Evolutionary Computation, 8(2): 173–195. [26] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., 2002, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation, 6(2): 182–197. [27] Coello, C. A. C., Pulido, G. T., and Lechuga, M. S., 2004, “Handling Multiple Objectives with Particle Swarm Optimization,” IEEE Transactions on Evolutionary Computation, 8(3): 256–279. [28] Zhang, Q., Li, H., 2007, “MOEA/D: A Multi-objective Evolutionary Algorithm Based on Decomposition,” IEEE Transactions on Evolutionary Computation, 11(6): 712–731. [29] Bandyopadhyay, S., Saha, S., Maulik, U., and Deb, K., 2008, “A Simulated Annealingbased Multi-objective Optimization Algorithm: AMOSA,” IEEE Transactions on Evolutionary Computation, 12(3): 269–283. [30] Hadka, D., and P. Reed. 2013. “Borg: An Auto-Adaptive Many-Objective Evolutionary Computing Framework.” Evolutionary Computation 21 (2):231–259. [31] Balling, R., 1999, “Design by Shopping: A New Paradigm,” Proceedings of the Third World Congress of Structural and Multidisciplinary Optimization, 295–297. [32] Michalewicz, Z., and Schoenauer, M., 1996, “Evolutionary Algorithms for Constrained Parameter Optimization Problems,” Evolutionary Computation, 4(1): 1–32. [33] Coello, C. A. C., 2006, “Evolutionary Multi-objective Optimization: A Historical View of the Field,” Computational Intelligence Magazine, IEEE, 1(1): 28–36. [34] Coello, C. A. C., Lamont, G. B., and Van Veldhuisen, D. A., 2007, Evolutionary Algorithms for Solving Multi-objective Problems, Springer, NY. [35] Turner, C., Foster, G., Ferguson, S., Donndelinger, J., and Beltramo, M., 2012, “Creating Targeted Initial Populations for Genetic Product Searches,” 2012 Sawtooth Software Users Conference, Orlando, FL. [36] Turner, C., Foster, G., Ferguson, S., Donndelinger, J., “Creating Targeted Initial Populations for Genetic Product Searches in Heterogeneous Markets,” Engineering Optimization. [37] Foster, G., and Ferguson, S., “Enhanced Targeted Initial Populations for Multi-objective Product Line Optimization,” Proceedings of the ASME 2013 International Design Engineering Technical Conference & Computers and Information in Engineering Conference, Design Automation Conference, Portland, OR, DETC2013-13303. [38] Matlab, the Mathworks. [39] Simpson, T. W., Siddique, Z, and Jiao, R. J., 2006, Product Platform and Product Family Design: Methods and Applications, Springer. [40] Martin, M. V., and Ishii, K., 1996, “Design for Variety: A Methodology for Understanding the Costs of Product Proliferation,” Proceedings of the 1996 ASME Design Engineering Technical Conferences, Irvine, CA, DTM-1610. 302 [41] Martin, M. V., and Ishii, K., 1997, “Design for Variety: Development of Complexity Indices and Design Charts,” Proceedings of the 1997 ASME Design Engineering Technical Conferences, Sacramento, CA, DFM-4359. [42] Ferguson, S., Gurnani, A., Donndelinger, J., and Lewis, K., 2005, “A Study of Convergence and Mapping in Preliminary Vehicle Design,” International Journal of Vehicle Systems Modeling and Testing, 1(1/2/3): 192–215. 303 A SIMULATION BASED EVALUATION OF THE PROPERTIES OF ANCHORED MAXDIFF: STRENGTHS, LIMITATIONS AND RECOMMENDATIONS FOR PRACTICE JAKE LEE MARITZ RESEARCH JEFFREY P. DOTSON BRIGHAM YOUNG UNIVERSITY INTRODUCTION Over the past few years MaxDiff has emerged as an efficient way to elicit a rank ordering over a set of attributes. In its essence, MaxDiff is a specific type discrete choice exercise where respondents are asked to identify a subset of items from a list that they feel are the most and least preferred, as they relate to some decision of interest. Through repeated choices, researchers can infer the relative preference of these items, thus providing a prioritized list of actions a firm could take to improve operations. A well-known limitation of a standard MaxDiff study is that it does not allow for identification of a preference threshold, or the point that differentiates important from unimportant items. MaxDiff allows us to infer the relative preference of the items tested in the study, but cannot determine which of these items the respondent believes the company should actually change (i.e., which items would have a meaningful impact on their choice behavior). It is possible that respondents with vastly different preference thresholds can manifest similar response patterns, thus resulting in similar utility scores across the set of items. Without a preference threshold, a respondent who thinks all options are in need of improvement could be indistinguishable from a respondent that believes that nothing is in need of improvement. Anchored MaxDiff has emerged as a solution to this particular problem. Three anchored MaxDiff techniques have been proposed to resolve this issue through the introduction of a respondent-specific threshold (or anchor) within or in conjunction with the MaxDiff exercise. These approaches are the Indirect (Dual Response) method (Louviere, Orme), the Direct Approach (Lattery), and the Status-Quo Approach (Chrzan and Lee). Both the Indirect and Direct approaches have been examined in prior Sawtooth Software Conference Proceedings and white papers. While the Status-Quo approach has not been formally studied, it has been mentioned in the same sources. In this paper we examine both the theoretical and empirical properties of each of the proposed anchored MaxDiff techniques. This is accomplished through the use of a series of simulation studies. By simulating data from a process where we know the truth, we can contrast the ability of each approach to recover the true rank order of items and their relation to the preference threshold. Through this research we hope to identify when and under what circumstances each approach is likely to prove most effective, thus allowing us to provide practical advice to the practitioner community. 305 THEORETICAL FOUNDATIONS Our approach to studying the properties of the various anchored MaxDiff approaches is built upon the idea that all discrete outcomes can be characterized as the (truncated) realization of an unobserved continuous process. In the case of choice data generated through a random utility model, it is believed that there exists a latent (continuous) variable called utility that allows the respondent to define a preference ordering over alternatives in a choice set. The utility for each alternative in a choice set is assumed to be known by the respondent at the time of choice. Information about utility is revealed to the researcher as the respondent reacts to various choice sets (e.g., picks the best, picks the best and worst, rank orders, etc.). This process is illustrated in Figure 1 where a hypothetical respondent reacts to a set of 4 alternatives in a choice set. By selecting alternative B as the best, the respondent provides information to the researcher about the relative rank ordering of latent utility. Specifically, we know from this choice that the utility for alternative B is greater than the utility for alternatives A, C, and D. No information, however, is provided to the researcher about the relative ranking of utility for the non-selected alternatives. Figure 1 Implied Latent Utility Structure for a “Pick the Best” Choice Task Choice task Latent Utility A B A 1 B C C D D UD <> UC <> UA < UB In the case of a “Best-Worst” choice task the respondent provides information about the relative utility of two items in the choice set (i.e., the most and least preferred). In the example provided in Figure 2, the respondent identifies alternative B as the best and alternative C as the worst. As such, we know that alternative B is associated with the greatest level of utility and alternative C is associated with the lowest level of utility. No information is provided about the relative attractiveness of alternatives A and D. An argument in favor of MaxDiff analysis is that it economizes respondent time and effort by extracting more information from a given choice set than would be obtained by having the respondent simply pick the best. Presumably, it is easier to evaluate a single choice set and pick the best and worst alternatives than it would be to pick the best alternative from two separate choice sets. 306 Figure 2 Implied Latent Utility Structure for a “Best-Worst” Choice Task Choice task Latent Utility A A B 1 C 4 D B C D UC < UD <> UA < UB Figure 3 extends the analysis in Figure 2 by introducing a preference threshold. In this example the respondent is asked to pick the best and worst alternatives, thus informing us about their relative latent utility. In a follow-up question, the respondent is asked if any of these of these items exceed their preference threshold. This is an example of the indirect or dual-response approach to anchored MaxDiff. As illustrated in Figure 3, an answer of “no” informs the researcher that the latent utility for the outside good (i.e., the preference threshold) is greater than the utility for all of the alternatives within the choice set. This is extremely useful information as it tells us that even though alternative B is most preferred, it is only the best option of an unattractive set of options. Investing in alternative B would not lead to a meaningful change in the respondent’s behavior. 307 Figure 3 Implied Latent Utility Structure for a “Best-Worst” Choice Task with a Dual Response Anchor Choice task Latent Utility A B A B 1 C C 4 D D Anchor Anchor Importance threshold no APPROACHES TO ANCHORED MAXDIFF As discussed above, three approaches have proposed to provide a preference threshold or anchor for MaxDiff studies: The Indirect Approach The Indirect (or dual response) Approach to anchored MaxDiff involves the use of a series of follow-up questions for each choice task. An example of this style of choice task is presented in Figure 4. In each choice task, respondents are first asked to complete a standard MaxDiff exercise. Following selection of the best and worst options, they are asked to identify which of the following are true: (1) All of these features would enhance my experience, (2) None of these features would enhance my experience, or (3) Some of these features would enhance my experience. The subject’s response to this question informs us about the relative location of the preference threshold in the latent utility space. Selection of option (1) tells us that the utility for the anchor is less than the utility for all presented features, whereas selection of option (2) tells us that the latent utility for the anchor is greater than the utility for the presented features. By choosing option (3) we know that the latent utility of the anchor falls somewhere between the best and worst features. 308 Figure 4 Example Choice Task with an Indirect (Dual Response) Anchor Thinking of your restaurant visit, which of these features, if improved, would most/least enhance your experience? Most Preferred Have Have Have Have Least Preferred the restaurant be cleaner the server stop by more often the server stop by less often more choices on the menu Considering just the 4 features above, which of the following best describes your opinion about enhancing your experience? All 4 of these features would enhance my experience None of these features would enhance my experience Some of these features would enhance my experience, some would not Anchor < UC < UD <> UB < UA The Direct Approach An illustration of the Direct Approach to anchored MaxDiff is presented in Figure 5. In the Direct Approach, respondents first complete a standard MaxDiff study. Upon conclusion, they are given a follow up task wherein they are asked to evaluate the relative preference (relative to the defined preference threshold) for each of the features in the study. In the example below, respondents are asked to identify which of the features in the study would have a “very strong positive impact on their experience.” If the respondent selects the first option from the list and none of the other options, we would know that the latent utility of that option is greater than the utility of the anchor, and that the utilities of the remaining features are less than the anchor. In general, an item that is selected in this task has utility in excess of the anchor and items not selected have utility below the anchor. 309 Figure 5 Example Choice Task with the Direct Anchor Please tell us which of the features below would have a very strong positive impact on your experience at the restaurant. (Check all that apply) Be greeted more promptly Have the restaurant be cleaner Change the décor in the restaurant Have the server stop by more often Have the server stop by less often Have the meal served more slowly Have the meal served more quickly Receive the check more quickly Have the server wait longer to deliver the check Have lower priced menu items Have fewer kinds of food on the menu Have more choices on the menu None of these would have a strong positive impact on my experience … <> UD <> UC <> UB < Anchor < UA The Status-Quo Approach The Status-Quo approach is implemented by incorporating a preference anchor directly into the study by including it as an attribute in the experimental design. This is illustrated in Figure 6 where the anchor is specified as “No changes—leave everything as it was.” If the option corresponding to the anchor is selected, we know that the latent utility for the anchor exceeds the latent utility for all other alternatives in the choice task. If the anchor is selected as the least preferred attribute, we know that its utility is less than the utility for the other features. Finally, if the anchor is not selected we know that its utility falls somewhere in between the most and least preferred features. Figure 6 Example Choice Task with a Status-Quo Anchor Thinking of your restaurant visit, which attribute if improved would most/least enhance your experience? Most Preferred Least Preferred Have the restaurant be cleaner Have the server stop by more often Have the server stop by less often Have more choices on the menu No changes - leave everything as it was UC < UD <> UB <> UA < Anchor 310 SIMULATION STUDY We contrast the performance of each of the proposed anchored MaxDiff approaches using synthetic choice data (i.e., data where the true latent utility is known). We strongly prefer the use of simulated data for this exercise for a few reasons. First, with real respondents we wouldn’t know their true preference structure and would be left to make comparisons based on model fit not the ability of the approach to recover the true underlying preferences. Second, it allows us to abstract away from framing differences in the execution of the anchored MaxDiff approaches described above. It would be exceptionally difficult to frame the questions in each of these approaches in such a way that they would be consistently interpreted by subjects. Simulation allows us to assume that respondents are both rational and fully informed when completing these exercises. Also we avoid psychological effects like the ordering of the tasks and respondent fatigue across 3 full exercises. Data for our study are simulated using the following procedure: 1. Simulate the (continuous) latent utility for each respondent and each alternative in a choice task according the standard random utility model: , where is a pre-specified vector of preference parameters for a given respondent and is the random component of utility and is drawn from a Gumbel distribution. 2. Each simulated respondent is then presented with a set of experimental choice exercises where they provide best, worst, and anchor responses for each of the three proposed anchored MaxDiff approaches. It is important to note that given a realization of utility within a choice set, we can use the same data to populate responses for all of the anchored MaxDiff approaches. In other words, the same data generating mechanism underlies all three approaches. They differ only in terms of how that information is manifest through the subject’s response behavior. 3. Data generated from this simulation exercise is then modeled using an HB MNL model to recover the latent utility structure for each simulated respondent. 4. These estimated utilities are then compared with the simulated or true latent utilities, thus allowing us to assess performance of each of the proposed methods. The procedure described above was repeated under a variety of settings where we modify the number of attributes, location of preference (e.g., all good, all bad, or mixed), and respondent consistency (i.e., error scale). Please note that for our simulation we coded the Direct Method to be consistent with the Sawtooth Software procedure. That is, for each attribute an additional task is included in the design matrix pitting the attribute against the anchor with the choice indicating if the box was selected or not. An alternative specification would be to add just two additional tasks to the design matrix to identify which attributes are above and below the threshold (Lattery). SIMULATIONS STUDY RESULTS We examine 2 objective measures of model performance in order to contrast the results of each variant of Anchored MaxDiff across the various permutations of the simulation. The first measure examined is threshold classification. Threshold classification is a measure of how well the model correctly classifies each of the attributes with respect to the anchor/threshold. It is reported as the percent of attributes that are misclassified. The second measure of performance is 311 how well the approach is able to recover the true rank ordering of the attributes. This is reported as the percent of items that are incorrectly ranked. Model results are presented below for each of the experimentally manipulated settings: Number of Attributes In this first test we manipulate the number of attributes tested in a given design. We examine three conditions, Low, Medium, and High. The number of tasks was held constant for each condition resulting in an increasing attribute to task ratio. Results of this test are provided in Figure 7. We observe that with either a medium or low number of attributes all three approaches perform equally well. However, in the High condition we observe that the Direct Approach seems to outperform both the Indirect and Status-Quo approaches, with the Status-Quo Approach performing slightly better than the Indirect Approach. We believe that the superiority of the Direct Approach in the case of many attributes can be explained by examining the structure of the follow-up task. In the case of the Direct Approach, respondents evaluate each item with respect to the anchor following completion of the MaxDiff exercise. If there are 50 items being tested in the study, this requires completion of an additional 50 questions. In the case of the Status-Quo and Indirect approaches, information about the preference of items relative to the anchor is captured through the MaxDiff or follow-up question. For studies with many attributes and relatively few choice tasks, this may not provide much information about the relationship between a given attribute and the anchor. As such, the Direct Approach should be more informative about the relative rank ordering and preference of items. Figure 7 Number of attributes Threshold Classification (Percent Misclassified) Rank Ordering (Percent Missed) 10.0% 14.0% 12.0% 8.0% 10.0% 6.0% 8.0% 4.0% 6.0% 4.0% 2.0% 2.0% 0.0% 0.0% Low Indirect Method Medium Direct Method High Status-Quo Low Indirect Method Medium Direct Method High Status-Quo Preference Location In this second test, we manipulate the location of preference parameters relative to the anchor. In condition 1, all simulated parameters are lower than the preference threshold (i.e., 312 nothing should be changed). In condition 2 we set all simulated parameters above the preference threshold. In the 3rd and final condition we allow the parameters to be mixed with respect to the threshold (i.e., some items are preferred over the threshold and some others are not). Results for this first test are presented in Figure 8. In terms of recovery of threshold classification, as expected, all three approaches perform exceptionally well when the preference parameters are all either above or below the threshold. When valuations are mixed with respect to the anchor, the Indirect Method does a slightly better job at recovering parameters. The indirect method can be the most informative approach, as it requires a follow-up question in each choice task. But, only if the middle anchor point is not used too much. When valuations are all above or below the threshold, it appears as though the Status-Quo method was less able to recover the true rank ordering of items. This can be explained by observing that under this condition the Status-Quo method should be much less informative about the rank ordering of items. As the Status-Quo option appears as an alternative in each choice task, it will always be selected when valuations for the reaming items are either all good or all bad (relative to the anchor). As such, we collect less information about the latent utility of the tested items than we would with mixed valuations. An alternative design with the Status-Quo option appearing in a subset of the scenarios (Sawtooth Software MaxDiff documentation) would likely improve the rank order recovery for this method. This finding implies that if we expect all of the items to be either all good or all bad (relative to the anchor), we should not use the Status-Quo approach for Anchored MaxDiff or have the threshold level appear in a subset of the tasks. That said, if we have strong knowledge of the relative utility of the items with respect to the anchor, we do not need to use Anchored MaxDiff in the first place. Figure 8 Preference Location Threshold Classification (Percent Misclassified) 4.0% Rank Ordering (Percent Missed) 5.0% 4.0% 3.0% 3.0% 2.0% 2.0% 1.0% 1.0% 0.0% 0.0% All Bad All Good Indirect Method Direct Method Mixed All Bad All Good Mixed Status-Quo Indirect Method Direct Method Status-Quo 313 Respondent Consistency Respondent consistency is manipulated by increasing or decreasing the average attribute utility relative to the size of logit error scale. In the high-consistency condition, the average size of the preference parameters is large, thus allowing respondents to be very deterministic in their choices. In the low-consistency condition, the average size of the simulated parameters is very low relative to the error scale, thus increasing the perceived randomness in subject responses. For completeness we include a medium consistency condition. Results for this exercise are presented in Figure 9. When respondents are consistent in their choices (i.e., the high consistency condition) all three approaches perform well in both relative and absolute terms. However, in the medium- and low-consistency conditions the Direct Method performs much worse than either the Indirect or Status-Quo approaches. It is our belief that this is a direct result of the design of the Direct Method. In the Direct Method all information about the relative preference of an attribute with respect to the anchor is captured through a single direct binary question presented at the end of the survey. If respondents are inconsistent or error prone (as is the case in the medium and low conditions), a mistake in this last set of questions will have serious implications for our ability to recover the true threshold classification or rank ordering. In the Indirect and Status-Quo approaches, each item has the potential to be classified relative to the anchor multiple times. As such, any single error on the part of a respondent will not have as large of an impact on parameter estimation. This result implies that all three approaches are likely to work well if respondents are consistent in their choices. However, if respondents are inconsistent in their choices the Direct Method is likely to perform much worse than either the Indirect or Status-Quo approaches. This should be an important consideration in selection of an Anchored MaxDiff approach as it is difficult to forecast respondent consistency prior to the execution of a study. Figure 9 Respondent Consistency Threshold Classification (Percent Misclassified) Rank Ordering (Percent Missed) 70.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% High Indirect Method 314 Medium Direct Method Low Status-Quo High Indirect Method Medium Direct Method Low Status-Quo CONCLUSION Taken collectively, our analysis implies the following: Under “regular” circumstances all 3 anchoring techniques perform well. If the anchor location is not at an extreme, respondent consistency is high, and the number of attributes being tested is low, all 3 techniques are virtually indistinguishable in the simulation results. All else being equal the Status-Quo anchoring technique is the easiest to understand and implement. The Status-Quo method is also easier for an analyst to implement, as it does not require additional manipulation of the design matrix prior to estimation. The Direct Method should be avoided when respondent consistency is expected to be low. As the number of attributes increases the Direct Method outperforms the other approaches in simulation. However, implementation of this method with many attributes will substantially increase respondent burden. LIMITATIONS AND AREAS FOR FUTURE RESEARCH Error rate the anchor judgments We assumed zero error in the anchor judgments for all of the simulations. Using different rules for each approach to set the anchor error rate is not a good option as the assumptions used will likely determine the results. For each test we systematically controlled the error for the choice exercises to see if the approaches differed in accuracy with varying amounts of error. However, since the anchors judgments are completely different tasks there is not a way to systematically vary the same amount of error across the 3 approaches. A novel approach to fairly assign error across the anchor tasks would benefit future research. Appearance rate of the Status-Quo option In our simulation of the Status-Quo method, the anchor was included in each of the choice tasks. However the Status-Quo method is often performed by incorporating the anchor as one of many attributes in the experimental design, which leads to its inclusion in a smaller subset of the tasks. It will be interesting to see how well the Status-Quo method does as the frequency with which the anchor appears varies. Having the anchor appear less often might alleviate some of the error introduced when most of the items are above or below the anchor. Future research can easily be adapted to test this hypothesis. 315 Jake Lee 316 Jeffrey P. Dotson BEST-WORST CBC CONJOINT APPLIED TO SCHOOL CHOICE: SEPARATING ASPIRATION FROM AVERSION ANGELYN FAIRCHILD RTI INTERNATIONAL NAMIKA SAGARA JOEL HUBER DUKE UNIVERSITY WHY USE BW CBC CONJOINT TO STUDY SCHOOL CHOICE? Increasingly, school districts are allowing parents to have a role in selecting which school they would like their children to attend. These parents make high-impact decisions based on a number of complex criteria. In this pilot study, we apply best-worst (BW) CBC conjoint methods to elicit tradeoff preferences for features of schools. Given sets of four schools defined by eight attributes, parents indicate the school they would like most and least for their child. We prepare respondents for these conjoint choices using introductory texts, reflective questions, and practice questions that gradually increase in complexity. Despite a relatively small sample size (n=147), we find that Hierarchical Bayes (HB) preference weight estimates are relatively stable when best-worst choices are estimated either separately or jointly. Further we find that the “best” responses reflect relatively balanced focus on all attribute levels, while “worst” responses focus more on negative levels of attributes. Thus, the “best” responses reflect parents’ aspirations while “worst” responses reflect their aversions. Historically, most students in the United States were assigned to a school based on their home address. Choices about schooling were typically a byproduct of housing location in which the perceived quality of the school was a major factor in a family’s housing choice. Recently, in a number of areas, including Boston and Charlotte, school districts allow parent’s choice in determining which school their children will attend. The structure of these choice contexts varies greatly—some districts are expanding options for competitive magnet or charter schools, while others allow all parents to select their preferred school from a menu of options. These decisions are complex, high impact, and low frequency, making them, as we will propose, ideal for bestworst analysis. The best-worst analysis can be used to complement the analysis of actual school choice. While there have been studies of the outcomes of actual school choice programs (Hastings and Weinstein 2008, Philips, Hausman and Larson 2012), this is, to our knowledge, the first best-worst CBC analysis of school choice. Other similarly complex and high-impact decisions, including decisions over surgical options (Johnson, Hauber et al. 2010), cancer treatments (Bridges, Mohamed et al. 2012), and long term financial planning (Shu, Zeithammer and Payne, 2013), have been successfully studied using conjoint methods. For decisions like these, observational data may be weak or unavailable, or only available retrospectively, making preferences difficult to ascertain. When program planning and efficacy depend on anticipating choices, choice based conjoint is especially helpful because it provides an opportunity to understand choice preferences prospectively in a controlled experimental environment. 317 We developed a best-worst CBC conjoint survey following a format developed by Ely Dahan (2012). In our task parents are given eight sets of four school options and asked for each to indicate the school they would like best and the one they would like least for their child. There are two advantages of the best-worst (BW) task for our purposes. First, the task generates 16 choices from a display of eight choice sets, and thus has the potential to more efficiently assess preferences. Second, and more important, it is possible to make separate utility assessments of the Best and the Worst choices. In the context of school choice, it is possible that both best and worst choices reflect very similar utilities, differing only by scale, but we will show that in fact they substantively differ. We will also demonstrate the predictive value of the conjoint task by positioning the attributes in a two-dimensional space and projecting characteristics of respondents into that space. While a two dimensional space does not capture the rich heterogeneity in the relationships between respondent characteristics and choice preferences, it nevertheless provides insight into important differences across respondents in their school choices. Finally, we will show that measures of respondent values can be generated with different analytical models and that these measures are fairly consistent between models. In particular, we pit results from a Hierarchical Bayes (HB) estimate against a simple linear probability model in which preferences for the attribute levels are assumed to be linearly related. The consistency between the two models suggests that the simpler linear probability estimates could be used on the fly to provide respondents with feedback on their personal importance weight estimates. Below, we describe our survey development and administration in more detail, followed by a discussion of the methods used to analyze the choice data. We then discuss the results of our analysis and characterize contexts in which BW conjoint makes sense. SURVEY DESIGN AND IMPLEMENTATION The goal of the survey was to assess whether conjoint methods could be used to simulate real-world choices for schooling options, and to provide meaningful importance weight estimates based on stated preferences. In school choice programs, parents evaluate schools based on a broad range of criteria. Attributes included in the study were based on the school characteristics that were publicly available through official “school report cards,” online school quality ratings, and school websites. After identifying a list of salient school features, we identified plausible ranges reflecting actual variation in each attribute. Attributes were not included in the final choice questions if the actual variance was small. Examples of variables dropped for that reason include the number of days in school, length of the school day, and teacher credentials. We pretested the survey with a convenience sample of 15 parents in order to assess survey comprehension and to further narrow a long list of attributes down to the final eight. The choice scenarios asked respondents, “Suppose that you have just moved to a new area where families are able to choose which school they would most like their children to attend.” Parents were instructed to select the school they liked most and the school they liked least from sets of four in each of the eight choice questions. To acclimate them to the choice scenario and question format, respondents completed a series of practice questions that increased in difficulty as new attributes were introduced. The simplest practice choice included only two continuous attributes and two schools, teaching respondents how to indicate their most and least liked school. The next practice choice included four continuous attributes and four schools, and 318 featured one clearly superior profile and one clearly inferior profile as a test for respondent comprehension. The full BW conjoint questions included four continuous and four binary attributes similar to that shown in Figure 1. Figure 1: Sample B-W Choice Task Respondents were introduced to the survey attributes one at a time. The attributes and levels included in the final survey are shown in Table 1 and Table 2. For each attribute, we defined the attribute and listed the possible range of levels. In addition, we asked a series of questions about the respondent’s experience with that feature at their child’s current school. This reinforcement encourages respondents to think about each feature and relate the new information to their own experience. Symbols representing each binary feature were introduced as part of the attribute descriptions. Respondents were also quizzed to test and reinforce their comprehension of the symbols, and in our sample 90% of respondents got all of these questions right. Table 1: Continuous attributes Attribute Travel Time Levels 5 minutes 15 minutes 30 minutes 45 minutes Description One way that the schools will differ for you is how long it will take for your child to get to school by public school bus. All schools have bus transport from your home, but the bus ride can take from 5 to 45 minutes. Of course your child does not have to take the bus, but for these choices you can assume that driving will take approximately the same time 319 Academic: Percent under grade level 15% 25% 35% 45% Economic: Percent economically disadvantaged 10% 30% 50% 70% Percent minority 25% 40% 55% 70% The schools also differ in terms of academic quality. Every year, students in public schools take tests that measure if their skills are sufficient for their grade level. If students’ scores are not high enough on these tests they are considered below grade level. Every school has some students who are below grade level. Among the schools in your new area, schools could have as few as 15% (15 out of 100) and as many as 45% (45 out of 100) of students below grade level. The schools in your new area also differ in the percent of students that are economically disadvantaged. Often children from low income families can get help paying for things like school lunches, after school programs, and school fees. The percent of students at a school who are economically disadvantaged is different in every school and could range from 10% to 70% of students (10 to 70 out of every 100 students). The schools in your new area also differ in racial diversity. Students come from many racial or ethnic backgrounds. The percent of students who are defined as minorities (African American, Hispanic/Latino, American Indian or people from other racial or ethnic backgrounds besides Caucasian or White) ranges from 25% to 70% (25 to 70 out of 100 students). Table 2: Binary features Attribute Promote sports teams Symbol Description All of the schools in your new area have physical education classes, where children play sports and exercise with the rest of their class. Some schools encourage students to join sports teams. When students join sports teams, they practice every day after school with other students their age, and play games against other schools. In your new area, students at schools with sports teams can choose to play various sports, such as basketball, volleyball, wrestling, baseball, track and field, football, or soccer. School sports teams do not cost any money, but students on sports teams may not be able to ride the bus home and often need a ride home from practice and games in the evening. 320 International Baccalaureate (IB) program Science, Technology, Engineering, and Math (S.T.E.M) Expanded Arts Program Some schools offer the International Baccalaureate (IB) program, which connects students all around the world with a shared global curriculum. The IB program emphasizes intellectual challenge, encouraging students to make connections between their studies in traditional subjects and to the real world. It fosters the development of skills for communication, intercultural understanding and global engagement, qualities that are essential for life in the 21st century. Some schools offer special classes in science, technology, engineering, and math (S.T.E.M.). Students in these classes get extra practice with science and math, working on computers, and doing projects that use science and math skills. These classes help students to be more prepared for advanced classes, college, and jobs. Some schools have an expanded arts program that allows children to practice their artistic abilities. When schools have expanded arts programs, students can choose to take classes like theater, dance, choir, band, painting, drawing, or ceramics. Students who take these classes may participate in performances and in competitions against other schools. In addition to the attribute definitions, reflective questions, practice choice questions, and actual choice questions, the survey also included a series of background questions. These included questions on the respondents’ demographic background, their school-age children, and the degree of their involvement with their child’s education. We also included self-explicated importance questions, in which respondents rated the importance of each of the 8 attributes included in the choice questions. For all responses, we used an efficient fixed choice design with level balance and good orthogonal properties that was built using SAS modules (Kuhfeld and Wurst 2012). ANALYSIS The survey was administered online to a U.S. nationwide sample of 147 SSI panelists. Respondents were prescreened to include only parents who expect to send a child to a public middle or high school (grades 6 through 12). The sample was 57% female, 59% white, with 59% having had some college and 50% having an income of under $50,000 per year. In order to differentiate between choice preferences for the most vs. least liked school we conducted independent HB analyses via Sawtooth Software of the eight Best and of the eight Worst responses. These results, shown in Figure 2, demonstrate substantial differences between these two tasks. In particular, we see that the least liked school puts more emphasis on avoiding the worst levels of each of the 4-level attributes. Put differently, the worst judgments do not differentiate as much between the best and second-best levels of the attributes but strongly differentiate between the worst and the second worst levels. Such a pattern is consistent with 321 respondents in the Worst task avoiding the extreme negative features following a relatively noncompensatory process. There also is a difference in the mean importance among the attributes as measured by the average of the individual utility differences. In both conditions the most important attribute is clearly academic quality. However, travel time is in second position for the Best choices but both percent economically disadvantaged and percent minority are more important for the Worst choices. Thus, when choosing the most liked school travel time is more important than racial considerations, whereas the reverse happens when trying to avoid the worst school. Figure 2 Part-worth Values for Most Liked Schools (solid squares) and Least Liked Schools (dotted circles) Figure 2 characterizes the part-worth utilities averaged across respondents. However, it does not represent preference heterogeneity across parents. To better understand this heterogeneity, we computed normalized individual level importance scores for each of the attributes in each of the Best and Worst tasks and submitted them to a principal components analysis. Normalized importance scores were calculated for each respondent as the difference between the most and least preferred levels within an attribute, represented as a percentage of the sum of these differences across all eight attributes. Figure 3 provides a two-dimensional solution to the principal components analysis in which positively correlated attributes are grouped together. The normalized importances from the Best choices are represented as squares while the Worst are 322 represented as circles. These points reflect the factor loadings of the two-dimensional factor solution. Figure 3 Principal components representation in two dimensions of eight attribute importance measures from the Best (squares) and the Worst (circles) judgments The vertical dimension is anchored at the top by academic quality as defined by the percent of students below grade level. Notice that both the Best and Worst measures load about equally on that dimension. The attributes at the bottom of the vertical dimension are Sports and Arts, reflecting their generally negative correlations (around -.4) with academic quality. The map thus suggests that those who value sports are less likely to care about the percent of people in the class that test below grade level, and those who place a high value on student test scores place less importance on sports and arts. The degree of fit for the loadings in this space is linearly related to their distance from the origin. Using that criterion, the results from the Best and Worst choices both span the entire space, although Best fits slightly better than Worst. Figure 4 projects vectors representing respondent characteristics onto the orthogonal factor scores to reflect the correlations between respondent demographics and preference patterns. 323 These respondent characteristic vectors show that better educated parents who are employed full time are more likely to value schools with high academic quality, while those with part time employment and less education are more likely to prefer schools with strong sports or arts programs. Figure 4 Projections of respondent characteristic vectors into the two-dimension principal components space The horizontal dimension contrasts those who care about what is taught in the school with those who are concerned with who is in the school. To the right are those for whom the content areas of IB and STEM are important, populated largely by minority parents and those with older children. To the left are parents concerned with the number of students from economically disadvantaged family backgrounds and with the percent minority representation at the school. This latter group of parents tend be white, have younger children and more income. There are two simplifications in the choice map given in Figures 3 and 4. First, we are summarizing eight attributes into a two-dimensional space. While the space accounts for 34% of 324 the variation, much variation is still left out. For example, this map could be expanded to include a three dimensional space that focuses on bus travel time, yielding additional insights into the interplay between preferences and demographic characteristics. The second simplification is that the map is generated from importance scores, which replace the information on the 4-level attributes with one importance measure reflecting the range of those part-worths. Little information is lost for monotone attributes like percent of students under grade level and bus travel time. However, for more complex attributes like percent minority or percent economically disadvantaged, around 25% of respondents found an interior level to be superior to both extreme levels. For these respondents, the correlation map shown in Figure 3 fails to capture the implied “optimal” point for these attributes. Finally, we note that the analysis on Figure 3 is generated from the separate Best and Worst choices. A very similar result emerges with the combined values as well, except that the combined values tend to have slightly higher loadings and marginally greater correlations with parent characteristics. LINEAR PROBABILITY INDIVIDUAL CHOICE MODEL In Dahan’s (2012) study he estimates individual choice models for each person in real time, rather than batch processing the entire set of results using Hierarchical Bayes. We use a similar simple linear probability model to produce a different estimate of individual level preferences. The linear model entails three heroic assumptions. First, probability of choice is treated as a linear dependent variable even though it should be represented with a logit or a probit function. Second, part-worth functions for the four-level attributes are linearized, thus ignoring any curvature. Finally, coefficients of the attributes for the “worst” choices are assumed to be the negative of the coefficients for the “best” choices. These assumptions are sufficiently troubling that one would not build such a model except, as Dahan suggests, where there is a need to give respondents real-time feedback on the linear importance of the choices made. We will show that in spite of its questionable assumptions, the linear probability choice model shows a surprising correspondence to the appropriate Hierarchical Bayes results. The process of building the linear probability model is simple and follows from the general linear model for our study with 8 attributes and 8 Best-Worst choices from sets of four profiles. Define choice vector Y with 32 items, 4 for each choice set, coding Best as a 1, Worst as 1 and zero otherwise Zero center the design matrix, X(32,8), within each choice set Estimate = (X’X)-1X’Y, by multiplying (X’X)-1X’ (8 x32) by Y(32x1) to get (8x1) The resulting row vector reflects the linear impact of a unit change in each of the attributes on the probability a person will choose the item. This calculation is sufficiently simple that it could be programmed in to a computer-based survey and multiplied by the respondent’s choice vector Y to produce linear importance estimates. This is particularly true when, as in our case, the design is fixed, so (X’X)-1X’ need be computed only once and can be done in advance. With a random design, it might be necessary to do this inversion and multiplication on the fly for each respondent. Our pilot survey did not implement this immediate calculation step; however it would be possible to do so. 325 Given the questionable assumptions required to support this simplified model, it is natural to wonder how well the linear probability results compare with the HB results. Figure 5 shows the correlation between the combined best-worst HB and the linear probability model. For most attributes, including academic quality, percent minority, travel time, sports, STEM and arts, correlation coefficients were relatively high and ranged between .70 and 90. Linear and HB estimates were less correlated for IB and Economic welfare of the respondent. Figure 5 Correlation of Importances between HB and Linear Choice Models We also compared the HB and linear estimates of choices with a direct self-explicated measure of importance. Our survey included self-explicated importance questions asking how important each attribute was on a 7-point scale. We zero centered these self-explicated importance ratings within each respondent and computed correlations with the linear and HB importance scores. The correlations are relatively poor for both models, which may reflect the difficulty people generally have with such direct assessments of importance. However, for four out of eight attributes HB is more correlated with the self-explicated data than is the linear model; for two attributes the linear model is more correlated, and for two attributes the HB and linear models are about equally correlated with the self-explicated importances. There are a number of contexts where the ability to give feedback on the fly would be useful. Consider a BW conjoint study that asks cardiac patients about their tradeoffs between surgery options that vary on the extent of the operation, likelihood of success, risk of complications, recovery time and out-of-pocket costs. Patients could then be given an immediate summary of what is important to them from the linear probability model, and asked if they would like to 326 change any of its values. Then the patients would have the option of sending those adjusted values to the surgeon who would meet with them to help make their surgery decision. Figure 6 Correlation of HB (dark) and Linear (light) Importance Estimates with Direct Self-explicated Estimates of Importance CONCLUSIONS There are a number of surprising conclusions that are suggested by this pilot study of school choice. These relate to the contexts in which best-worst conjoint is appropriate, the kind of analyses that should be used with this data, and the applicability of the results in a real-world context. Each of these is discussed below. BW conjoint is most appropriate for choices where respondents may have both strong desires for, and aversions to, specific attribute levels. Aversion is typically not an issue for choices among package goods such as breakfast cereal, or short-lived experiences such as weekend vacations. In those cases where there are many options and few long-term risks, people quickly focus on trading off what they want rather than focusing on what they do not want. For example, one would not learn much about person’s cereal purchase by knowing that their least favorite is Coco-Puffs. In such cases, one might consider including first and second choices, but evidence indicates that it may not be worth the extra respondent time required (Johnson and Orme 1996). However, choices with desired positive and unavoidable negative features are ideal for bestworst CBC. These include medical choices which combine positive and negative features, housing choices in which any decision has clear advantages and disadvantages, and in our case, 327 school choices. Our results clearly show that the part-worth patterns for the Worst option are quite different from the Best. Asking respondents to think about their least preferred alternative reveals what they most want to avoid, and such avoidance behavior may override the otherwise positive features of an alternative. Thus, the two perspectives provide a chance to better understand the mechanisms of choice. We tested and compared several methods for estimating the importance of the attributes to individual respondents. While the HB models produced a better fit and required fewer questionable assumptions, the linear model generated a surprisingly highly correlation with the HB model. While not perfect, a linear model may be useful where there is a desire to give immediate feedback to the respondent. Such feedback could be used as a decision aid for people in the process of making a high-impact, low frequency decision. Our analysis demonstrates not only that conjoint methods can be used to elicit preferences for school choices, but also that best-worst choices provide additional salient insight into what parents are willing to trade off in choosing schools. Similar conjoint studies could be used in school district planning or designing school choice policies. For example, the results from this pilot survey suggest that schools that have difficulty attracting strong academic students might increase enrollment among some demographic groups by developing strong arts or sports programs. Alternatively, school district planners might use a school choice simulation within their target community to generate school choice sets that maximizes the probability that children attend a desired school and minimizes the likelihood of being assigned to an undesired school. In sum, best-worst CBC is an important method of dealing with high impact, long-term decisions. Joel Huber REFERENCES Bridges, John F., A. F. Mohamed, et al. (2012). “Patients’ preferences for treatment outcomes for advanced non-small cell lung cancer: A conjoint analysis.” Lung Cancer 77(1): 224–231. Dahan, Ely (2012) “Adaptive Best-Worst (ABC) Conjoint Analysis,” 2012 Sawtooth Software Conference Proceedings, 223–236. Hastings, Justine S., and Jeffrey M. Weinstein (2008) “Information, school choice, and academic achievement: Evidence from two experiments.” The Quarterly Journal of Economics 123.4, 1373–1414. 328 Johnson, Richard M. and Bryan K. Orme (1996) “How Many Questions Should You Ask in Choice-Based Conjoint Studies,” www.sawtoothsoftware.com/download/techpap/howmanyq.pdf Johnson, F. Reed, B. Hauber, et al. (2010). “Are gastroenterologists less tolerant of treatment risks than patients? Benefit-risk preferences in Crohn’s disease management.” Journal of Managed Care Pharmacy 16(8): 616–628. Kuhfeld, Warren F. and John C. Wurst (2012) “An Overview of the Design of Stated Choice Experiments,” 2012 Sawtooth Software Conference Proceedings, p. 165–194. Phillips, Kristie JR, Charles Hausman, and Elisabeth S. Larsen (1212) “Students Who Choose And The Schools They Leave: Examining Participation in Intradistrict Transfers.” The Sociological Quarterly 53.2, 264–294. Shu, Suzanne., Robert Zeithammer and John Payne (Working paper, UCLA). “Consumer Preferences of Annuities: Beyond NPV.” 329 DOES THE ANALYSIS OF MAXDIFF DATA REQUIRE SEPARATE SCALING FACTORS? JACK HORNE BOB RAYNER MARKET STRATEGIES INTERNATIONAL Scale of the error terms around MaxDiff utilities varies between best and worst responses. Most estimation procedures, however, assume that scale is fixed, leading to potential bias in the estimated utilities. We investigate to what degree scale actually does vary between response categories, and whether true utilities may be better recovered by properly specifying scale when estimating combined best-worst utilities. BEST-WORST SCALING: TWO RESPONSE TYPES, ONE SET OF UTILITIES Maximum-difference (MaxDiff) or Best-Worst scaling has been a widely used technique among market researchers since it was first introduced more than 20 years ago by Jordan Louviere (Louviere, 1991; Cohen and Orme, 2004). The technique involves repeated choices of best and worst items in tasks where only small subsets (usually 4 at a time) of the total number of items are presented. A single vector of utilities is estimated from the choice data, equal in length to J - 1 where J is the total number of items, using a MNL model. To account for some items being selected as best and others as worst, response data are coded as 1 for best choices and as -1 for worst choices. The selection of an item as worst in other words leads to a more negative utility for that item, while the selection of an item as best leads to a more positive utility. Figure 1. Best and worst utilities, estimated separately. y=x least squares This analytic framework assumes that the best and worst choices used to generate the single set of utilities follow the same distribution. In point of fact though, we often see that best and worst utilities (estimated separately) are distributed differently. Figure 1 shows this idea for a data set with 17 attributes. The range of the worst utilities is wider than the range of the best 331 utilities by a factor of 1.3. This suggests that there is less error around the worst utilities in this data set, leading to a greater deviation from 0 in the estimated utilities for those responses. The above-described analytic framework does not take into account this distributional difference between best and worst responses. Dyachenko et al. (2013a, 2013b) have found similar patterns in other data sets. They suggest that best and worst responses may follow different distributions as a result of a sequential effect, respondents are more accurate in later choices, and an elicitation effect, respondents are more certain about what they like least. Regardless of the cause, what effect do these psychological processes have on estimation of a single set of best-worst utilities? MATHEMATICS OF CHOICE PROBABILITIES AND ERROR DISTRIBUTIONS Utilities are typically estimated from best-worst choice data using a logit model. That model is described as: , where is the observable part of the utility, often , and is the unobservable error surrounding it (Train, 2007). Errors ( ) are Gumbel distributed, and have a mode parameter and a positive scale parameter . As increases, the magnitude of estimated utilities decreases, all else being equal. Delving further, choice probabilities from the logit model were defined by McFadden (1974) as follows, where Pni is the probability that respondent n will choose item i: The typical analytic framework used in estimating MaxDiff utilities assumes that scale is fixed, and drops out of the above equation. However, as we and others have seen, is not always fixed across best and worst responses (the ranges of these utilities, estimated separately, can vary considerably). In this paper, we investigate, using simulations and actual MaxDiff data, whether failure to account for these different scales affects the estimated utilities. SIMULATIONS Seventeen utilities were generated, ranging from -8/3 to +8/3 (equally spaced, and common across 10,000 “respondents.” Error around best responses is distributed as Gumbel Type I and error around worst responses is distributed as Gumbel Type II (Orme, 2013). The Gumbel Type I CDF is Therefore, Gumbel Type I error was added to the above utilities to form distributions of “best” utilities using the following construct: where is a random uniform variant ranging from 0 to 1. Scale ( ) was uniformly set at 1 to form best utilities. 332 Gumbel Type II error is defined as Given this relationship, error was subtracted from generated utilities to form distributions of worst utilities, using the same construct used in forming best utilities. Scale ( ) was varied in forming worst utilities. All of these utilities, two 10,000 “respondents” by 17 items matrices, were then converted to best-worst choice data using a design consisting of 10 tasks per “individual” and 4 items per task = 100,000 best responses and 100,000 worst responses. There were 8 versions in the design; “respondents” were randomly assigned to version. Finally, new best-worst combined utilities were estimated from the choice data by maximizing log likelihood ( ) in a MNL model, varying the assumed scale parameter for worst responses. + where = 1 if “respondent” chooses item ; else = 0. All estimation was at the individual level (HB) using custom R code developed by one of the authors to account for scale ratios (package: “HBLogitR,” forthcoming on CRAN). Results of several simulations are shown in Table 1. Utilities are best recovered when the actual scale ratio among best and worst utilities is the same as that used in estimation (assumed). When the assumed scale ratio (best/worst) is larger than the actual, estimated utilities are biased, especially those nearest the lower end of the utility range. Table 1. Estimated utilities from simulations and absolute deviances from actual. Numbers in the header refer to actual best/worst ratio, and assumed best/worst ratio. Mean absolute deviances are: 1/1 = 0.142; 2/2 = 0.075; 1/2 = 0.588. Actual -2.67 -2.33 -2.00 -1.67 -1.33 -1.00 -0.67 -0.33 0.00 0.33 0.67 1.00 1.33 1.67 2.00 2.33 2.67 1/1 -2.99 -2.51 -2.25 -1.82 -1.50 -1.05 -0.70 -0.33 -0.06 0.35 0.82 1.13 1.47 1.81 2.23 2.56 2.84 deviance 0.32 0.18 0.25 0.16 0.16 0.05 0.03 0.00 0.06 0.01 0.15 0.13 0.13 0.14 0.23 0.23 0.18 2/2 -2.86 -2.44 -2.16 -1.74 -1.41 -0.97 -0.67 -0.31 -0.03 0.35 0.73 1.09 1.37 1.77 2.11 2.46 2.74 deviance 0.19 0.10 0.16 0.07 0.07 0.03 0.01 0.02 0.03 0.02 0.07 0.04 0.04 0.11 0.12 0.12 0.08 1/2 -4.42 -3.58 -3.04 -2.30 -1.65 -0.95 -0.46 -0.01 0.21 0.77 1.11 1.53 1.87 2.23 2.58 2.91 3.21 deviance 1.75 1.25 1.04 0.63 0.32 0.05 0.21 0.32 0.21 0.44 0.44 0.53 0.54 0.56 0.58 0.58 0.54 If the assumed best/worst scale ratio is smaller than the actual, a similar bias occurs (not shown) where there is greater deviance at the upper end of the utility range. A bias is clearly 333 present when the scale ratio assumed in analysis does not match the scale ratio among the true utilities. ACTUAL DATA Two data sets were used to test whether accounting for scale differences between best and worst responses removes any bias from estimated utilities. Data set 1 consisted of 17 items and 300 respondents. Each respondent evaluated 10 (quads) best-worst exercises. Data set 2 consisted of 20 items and 918 respondents. Each respondent in this data set evaluated 15 (quads) best-worst exercises. The scale parameter around utilities estimated from a logit model is not identified (BenAkiva and Lerman, 1985; Swait and Louviere, 1993; Train, 2007). However, given two (or more) response types (e.g., best and worst) a scale ratio is estimable. There are several ways to estimate this scale ratio. In this paper, we estimate scale ratios by first estimating best and worst utilities independent of one another, and then regressing those utilities against one another, through the origin. The regression coefficient from this equation becomes the estimate of scale ratio. Another method, suggested by Kevin Lattery (personal communication) is to estimate scale ratios as the ratio of standard deviations of best and worst utilities, again estimated independently (e.g., ). One advantage of using this method is that it has reciprocal properties (i.e., ), which is not the case with the beta from a regression equation. Still another method involves estimating scale ratios via maximum likelihood, along with the betas, in a MNL model. We detail that method in a forthcoming paper (Rayner and Horne, forthcoming). All of these methods are capable of estimating scale ratios at the individual level, and all return similar results in terms of combined best-worst utilities when applied to the data sets used in this paper. For the sake of consistency alone, the below results all use the regression through the origin method. As in the simulated data, all estimation was at the individual level (HB), applying individual scale corrections using custom R code developed by one of the authors (package: “HBLogitR,” forthcoming on CRAN). Best and worst utilities differed from one another in both data sets when estimated separately (Figure 2; Data set 1: best = 0.838 x worst, t[H0: best = worst] = 15.1; Data set 2: best = 0.802 x worst, t[H0: best = worst] = 32.3). In both data sets, worst utilities tended to be distributed across a wider range than best utilities, indicating less error around the former. Figure 3 shows combined best-worst utilities from both data sets, estimated with and without correcting for scale ratios. Employing the scale ratio correction made little difference in how utilities were estimated (Data set 1: r=0.998, t=1127.2; Data set 2: r=0.995, t=1349.8). Failure to adjust for scale ratios led to no more (or less) statistical bias than estimating two consecutive HB runs on the same data, not correcting for scale ratios both times. Nevertheless, there were still some small differences in the ranges of utilities and in hit rates as a result of correcting for scale ratios in estimating combined best-worst utilities. Failure to correct for scale ratios led to wider ranges of estimated utilities in both data sets (Table 2, top panel). Further, the difference on the low end of the range was about twice as large as the difference on the high end. Both of these findings are in keeping with what was found earlier in 334 simulations when best/worst scale ratio assumed in analysis is smaller than the actual—or in this case, our estimation of the actual ratio. Figure 2. Best and worst utilities, estimated separately; left panel: data set 1, n=5100 utilities; right panel: data set 2, n=18360 utilities. Fit lines are y=x and least squares (flatter slope). The slope of the least squares lines relative to the diagonal is indicative of a wider range among worst utilities. Figure 3. Combined best-worst utilities, estimated with and without corrections for scale ratios among response types; left panel: data set 1; right panel: data set 2. There was also some, potentially systematic, bias in in-sample hit rates as a result of not correcting for scale ratios. In-sample hit rates in this circumstance were defined as respondents actually choosing an item in a task that their estimated utilities suggest they would. When corrected for scale ratios, overall hit rates in both data sets were slightly worse than when not corrected (Table 2, bottom panel). However, and perhaps more importantly, hit rates among worst choices improved slightly with correction, and hit rates among best choices worsened. It is possible that this resulted because the scale ratio correction usually involved an up-weight on 335 worst choices and a down-weight on best ones. The worst choices in effect became more important in determining the combined estimated utilities, while the best choices became less important. Table 2. Aggregated combined best-worst utilities, estimated with and without corrections for scale ratios among response types (top panel). In-sample hit rates; largest hit rates in each group are in boldface (bottom panel). Data set 1 with correction without Data set 2 with correction without min -3.29 -3.47 min -1.76 -1.99 Data set 1 with correction without Data set 2 with correction without Best tasks 2770 (92.3%) 2797 (93.2%) Best tasks 11009 (79.9%) 11355 (82.5%) 1Q’tile -1.05 -1.12 1Q’tile -0.27 -0.30 3Q’tile +1.88 +1.98 3Q’tile +0.47 +0.51 Worst tasks 2806 (93.5%) 2783 (92.8%) Worst tasks 11249 (81.7%) 11076 (80.4%) max +2.45 +2.57 max +1.09 +1.19 All tasks 5576 (92.9%) 5580 (93.0%) All tasks 22258 (80.8%) 22431 (81.4%) Utility ranks showed very little difference whether corrected for scale ratios or not. To the extent that utilities were ranked differently in the two different estimation methods (which was rare), the differences were often only one or two rank positions. These differences were similar again to what we would find if we estimated two consecutive HB runs on the same data under the same rules. So, there does appear to be a small bias in estimating utilities without correcting for scale ratios; this bias further appears to occur in the direction we might expect from simulations. But, the bias is not large enough to change any business decisions. The extra analysis required does not seem to be worth the effort if the goal of doing so is to remove bias. ANOTHER REASON TO ADJUST FOR SCALE RATIOS? There may however be another reason that justifies the extra analysis: cleaning data of particularly “noisy” respondents. Scale ratios can of course be estimated on an individual respondent basis. If a respondent’s best/worst scale ratio approaches zero, or infinity, or is negative, it is possible that that person has misunderstood the task (e.g., is selecting “next best” instead of “worst”) or is otherwise providing “noisy” data. This person’s combined best-worst utilities may be difficult to estimate. This is apparent from Figure 4. There is a small group of respondents who have worst/best scale ratios that are near zero and negative. The individual hit rates for these respondents are particularly small when combined utilities are estimated with a correction for the scale ratio. 336 Figure 4. Hit rates and individual scale ratios (from data set 2). Those respondents with individual worst/best scale ratios > 0.4 (n=863) have average hit rates of 82.7%; those with individual worst/best scale ratios < 0.4 (n=55) have average hit rates of 51.8%. Figure 5. Combined best-worst utilities, estimated with and without corrections for scale ratios among response types; left panel: all respondents, n=918; right panel: respondents with worst/best scale ratios > 0.4 (n=863). These small hit rates result because the combined utilities for this group of respondents are virtually inestimable when correcting for best/worst scale ratios. This can be seen in Figure 5. Utilities for those respondents with small or negative worst/best scale ratios tend to be estimated around zero when the correction is made, and add little but noise to the overall estimates (Figure 5, left panel). Removing those respondents and re-estimating utilities cleaned things up considerably (Figure 5, right panel). The same effect can be seen in the range of estimated utilities. Removing the “noisy” respondents led to increased utility ranges, as we might expect to be the case (Table 3). 337 Table 3. Combined best-worst utilities, estimated with scale ratio correction for all respondents in data set 2 (n=918), and for respondents with individual worst/best scale ratios > 0.4 (n=863). Data set 2 All respondents Small scale ratios removed min -1.76 -2.05 1Q’tile -0.27 -0.30 3Q’tile +0.47 +0.54 max +1.09 +1.25 The individual root likelihood (RLH) statistic has been suggested as one criterion for cleaning “noisy” respondents from data (Orme, 2013). We compared individual worst/best scale ratios to individual RLH statistics in one of the data sets and found no relationship (r=-0.01, t[H0: r=0]=-0.185, p=0.853). It seems then, at least from examination of a single data set, that using best/worst scale ratios provides a measure of respondent “noisiness” that can be used independently from (and in concert with) RLH. FINAL THOUGHTS Best and worst responses from MaxDiff tasks do tend to be scaled differently. Whatever the reason for this, not taking those differences into account when estimating combined best-worst utilities appears to add some systematic bias, albeit a small amount, to those utilities. The presence of this bias raises the question of whether or not to adjust for the scale differences during analysis. Adjustment requires additional effort, and that effort may not be seen as justified given the lack of strong findings here and elsewhere. Practitioners, rightly or not, may feel that if there is only a little bias and that bias will not change managerial decisions about the data, why not continue with the status quo? It is a reasonable opinion to hold. But, fundamentally, this bias results from having two kinds of responses—best and worst— combined into a single analysis. Perhaps the better solution is to tailor the response used to the desired objective. If the objective is to identify “best” alternatives, ask for only best responses; likewise, if the objective is to identify “worst” alternatives, ask for only worst responses. Doing so might better focus respondents’ attention on the business objective; we could include more choice tasks since we would be asking for one fewer response in each task; and we would not have to worry about the effect of different scales between response types. The difficulty with this approach for practitioners would be in identifying, managerially, what the objective should be (finding “bests” or eliminating “worsts”) in an age where we’ve become used to doing both in the same exercise. This idea remains the focus of future research. 338 Jack Horne REFERENCES Ben-Akiva, M. and Lerman, S. R. (1985). Discrete Choice Analysis: Theory and Application to Travel Demand. MIT Press, Cambridge, MA. Cohen S. and Orme, B. (2004). What’s your preference? Marketing Research, 16, pp. 32–37. Dyachenko, T. L., Naylor, R. W. and Allenby, G. M. (2013a). Models of sequential evaluation in best-worst choice tasks. Advanced Research Techniques (ART) Forum. Chicago. http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2072496 Dyachenko, T. L., Naylor, R. W., and Allenby, G. M. (2013b). Ballad of the best and worst. 2013 Sawtooth Software Conference Proceedings. Dana Point, CA. Louviere, J. J. (1991). Best-worst scaling: A model for the largest difference judgments. Working Paper. University of Alberta. McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior, in P. Zarembka (ed.), Frontiers in Econometrics. Academic Press. New York. pp. 105–142. Orme, B. (2013). MaxDiff/Web v.8 Technical Paper. http://www.sawtoothsoftware.com/education/techpap.shtml Rayner, B. K. and Horne, J. (forthcoming). Scaled MaxDiff. Marketing Letters (submitted). Swait, J. and Louviere, J. J. (1993). The role of the scale parameter in estimation and comparison of multinomial logit models. Journal of Marketing Research, 30, pp. 305–314. Train, K. E. (2007). Discrete Choice Methods with Simulation. Cambridge University Press. New York. pp. 38–79. 339 USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF PRODUCT FEATURES GREG ALLENBY OHIO STATE UNIVERSITY JEFF BRAZELL THE MODELLERS JOHN HOWELL PENN STATE UNIVERSITY PETER ROSSI UNIVERSITY OF CALIFORNIA LOS ANGELES ABSTRACT In this paper we propose an approach for using conjoint analysis to attach economic value to specific features that is commonly used in many econometric applications and in intellectual property litigation. A common approach to this task involves taking difference in utility levels and dividing by the price coefficient. This is fraught with difficulties including a) certain respondents projected to pay astronomically high amounts for features, and b) the approach ignores important competitive realities in the marketplace. In this paper we argued that to assess the economic value of a feature to a firm requires conducting market simulations (a share of preference analysis) involving a realistic set of competitors, including the outside good (the “None” category). Furthermore, it requires a game theoretic approach to compare the industry equilibrium prices with and without the focal product feature. 1. INTRODUCTION Valuation of product features is a critical part of the development and marketing of products and services. Firms are continuously involved in the improvement of existing products by adding new features and many “new products” are essentially old products which have been enhanced with features previously unavailable. For example, consider the smartphone category of products. As new generations of smartphones are produced and marketed, existing features such as screen resolution/size or cellular network speed are enhanced to new higher levels. In addition, features are added to enhance the usability of smartphone. These new features might include integration of social networking functions into the camera application of the smartphone. A classic example, which was involved in litigation between Apple and Samsung, is the use of icons with rounded edges. New and enhanced features often involve substantial development costs and sometimes also require new components which drive up the marginal cost of production. The decision to develop new features is a strategic decision involving not only the cost of adding the feature but also the possible competitive response. The development and marketing costs of feature enhancement must be weighed against the expected increase in profits which will accrue if the product feature is added or enhanced. Expected profits in a world with the new feature must be compared to expected profits in a world without the feature. Computing this change in expected profits involves predicting not only demand for the feature but also assessing 341 the new industry equilibrium that will prevail with a new set of products and competitive offerings. In a litigation context, product features are often at the core of patent disputes. In this paper, we will not consider the legal questions of whether or not the patent is valid and whether or not the defendant has infringed the patent(s) in dispute. We will focus on the economic value of the features enabled by the patent. The market value of the patent is determined both by the value of the features enabled as well as by the probability that the patent will be deemed to be a valid patent and the costs of defending the patent’s validity and enforcement. The practical content of both apparatus and method patents can be viewed as the enabling of product features. The potential value of the product feature(s) enabled by patent is what gives the patent value. That is, patents are valuable only to the extent that they enable product features not obtainable via other (so-called “non-infringing”) means. In both commercial and litigation realms, therefore, valuation of product features is critical to decision making and damages analysis. Conjoint Analysis (see, for example, Orme 2009 and Gustafsson et al. 2000) is designed to measure and simulate demand in situations where products can be assumed to be comprised of bundles of features. While conjoint analysis has been used for many years in product design (see the classic example in (Green and Wind, 1989)), the use of conjoint in patent litigation has only developed recently. Both uses of conjoint stem from the need to predict demand in the future (after the new product has been released) or in a counterfactual world in which the accused infringing products are withdrawn from the market. However, the literature has struggled, thus far, to precisely define meaning of “value” as applied to product features. The current practice is to compute what many authors call a Willingness to Pay (hereafter, WTP) or a Willingness To Buy (hereafter, WTB). WTP for a product feature enhancement is defined as the monetary amount which would be sufficient to compensate a consumer for the loss of the product feature or for a reduction to the non-enhanced state. WTB is defined as the change in sales or market share that would occur as the feature is added or enhanced. The problem with both the WTP and WTB measures is that they are not equilibrium outcomes. WTP measures only a shift in the demand curve and not what the change in equilibrium price will be as the feature is added or enhanced. WTB holds prices fixed and does not account for the fact that as a product becomes more valuable equilibrium prices will typically go up. We advocate using equilibrium outcomes (both price and shares) to determine the incremental economic profits that would accrue to a firm as a product is enhanced. In general, the WTP measure will overstate the change in equilibrium price and profits and the WTB measure will overstate the change in equilibrium market share. We illustrate this using a conjoint survey for digital cameras and the addition of a swivel screen display as the object of the valuation exercise. Standard WTP measures are shown to greatly overstate the value of the product feature. To compute equilibrium outcomes, we will have to make assumptions about cost and the nature of competition and the set of competitive offers. Conjoint studies will have to be designed with this in mind. In particular, greater care to include an appropriate set of competitive brands, handle the outside option appropriately, and estimate price sensitivity precisely must be exercised. 342 2. PSEUDO-WTP, TRUE WTP AND WTB In the context of conjoint studies, feature valuation is achieved by using various measures that relate only to the demand for the products and features and not to the supply. In particular, it is common to produce estimates of what some call Willingness To Pay and Willingness To Buy. Both WTP and WTB depend only on the parameters of the demand system. As such, the WTP and WTB measure cannot be measures of the market value of a product feature as they do not directly relate to what incremental profits a firm can earn on the basis of the product feature. In this section, we review the WTP and WTB measures and explain the likely biases in these measures in feature valuation. We also explain why the WTP measures used in practice are not true WTP measures and provide the correct definition of WTP. 2.1 The Standard Choice Model for Differentiated Product Demand Valuation of product features depends on a model for product demand. In most marketing and litigation contexts, a model of demand for differentiated products is appropriate. We briefly review the standard choice model for differentiated product demand. In many contexts, any one customer purchases at most one unit of the product. While it is straightforward to extend our framework to consider products with variable quantity purchases, we limit attention to the unit demand situation, and develop our model for a single respondent. In addition, we begin by considering a model for just one respondent. Extensions needed for multiple respondents are straightforward (see Rossi et al., 2005). The demand system then becomes a choice problem in which customers have J choice alternatives, each with characteristics vector, xj, and price, pj. The standard random utility model (McFadden, 1981) postulates that the utility for the jth alternative consists of a deterministic portion (driven by x and p) and an unobservable portion which is modeled, for convenience, as a Type I extreme value distribution. xj is a k x 1 vector of attributes of the product, including the feature that requires valuation. xf denotes the focal feature. Feature enhancement is modeled as alternative levels of the focal feature, xf (one element of the vector x), while the addition of features would simply have xf as a dummy or indicator variable. There are three important assumptions regarding the model above that are important for feature valuation: 1. This is a compensatory model with a linear utility. 2. We enter price linearly into the model instead of using the more common dummy variable coding used in the conjoint literature. That is, if price takes on K values, p1 pk , we include one price coefficient instead of the usual K-1 dummy variables to represent the different levels. In equilibrium calculations, we will want to consider prices at any value in some relevant range in order to use first order conditions which assume a continuum. 3. There is a random utility error that theoretically can take on any number on the real line. The random utility error, εj, represents the unobservable (to the investigator) part of utility. This means that actual utility received from any given choice alternative depends not only on the observed product attributes, x, and price but also on realizations from the error distribution. In the standard random utility model, there is the possibility of receiving up to infinite utility from 343 the choice alternative. This means that in evaluating the option to make choices from a set of products, we must consider the contribution not only of the observed or deterministic portion of utility but also the distribution of the utility errors. The possibilities for realization from the error distribution provide a source of utility for each choice alternative. In the conjoint literature, the β coefficients are called part-worths. It should be noted that the part-worths are expressed in a utility scale which has an arbitrary origin (as defined by the base alternative) and an equally arbitrary scaling (somewhat like the temperature scale). This means that we cannot compare elements of the β vector in ratio terms or utilizing percentages. In addition, if different consumers have different utility functions (which is almost a truism of marketing) then we cannot compare part-worths across individuals. For example, suppose that one respondent gets twice as much utility from feature A as feature B, while another respondent gets three times as much utility from feature B as A. All we can say is that the first respondent ranks A over B and the second ranks B over A; no statements can be made regarding the relative “liking” of the various features. 2.2 Pseudo WTP The arbitrary scaling of the logit choice parameters presents a challenge to interpretation. For this reason, there has been a lot of interest in various ways to convert part-worths into quantities such as market share or dollars which are defined on ratio scales. What is called “WTP” in the conjoint literature is one attempt to convert the part-worth of the focal feature, βf, to the dollar scale. Using a standard dummy variable coding, we can view the part-worth of the feature as representing the increase in deterministic utility that occurs when the feature is turned on. For feature enhancement, a dummy coding approach would require that we use the difference in partworths associated with the enhancement in the “WTP” calculation. If the feature part-worth is divided by the price coefficient, then we have converted to the ratio dollar scale. We will call this “pseudo-WTP” as it is not a true WTP measure as we explain below. This p-WTP measure is often justified by appeal to the simple argument that this is the amount by which price could be raised and still leave the “utility” for choice alternative J the same when the product feature is turned on. Others define this as a “willingness to accept” by giving the completely symmetric definition as the amount by which price would have to be lowered to yield the same utility in a product with the feature turned off as with a product with the feature turned on. Given the assumption of a linear utility model and a linear price term, both definitions are identical. In practice, reference price effects often make WTA differ from WTP, (see [Viscusi-Huber-2011]) but, in the standard economic model, these are equivalent. In the literature (Orme-2001-WTP), p-WTP is sometimes defined as the amount by which the price of the feature-enhanced product can be increased and still leave its market share unchanged. In a homogeneous logit model, this is identical to the expression above. Inspection of the p-WTP formula reveals at least two reasons why p-WTP formula cannot be true WTP. First, the change in WTP should depend on which product is being augmented with the feature. The conventional p-WTP formula is independent of which product variant is being augmented due to the additivity of the deterministic portion of the utility function. Second, true 344 WTP must be derived ex ante—before a product is chosen. That is, adding the feature to one of the J products in the market place enhances the possibilities for attaining high utility. Removing the feature reduces levels of utility by diminishing the opportunities in the choice set. This is all related to the assumption that on each choice occasion a separate set of choice errors are drawn. Thus, the actual realization of the random utility errors is not known prior to the choice, and must be factored into the calculations to estimate the true WTP. 2.3 True WTP WTP is an economic measure of social welfare derived from the principle of compensating variation. That is, WTP for a product is the amount of income that will compensate for the loss of utility obtained from the product; in other words, a consumer should be indifferent between having the product or not having the product with an additional income equal to the WTP. Indifference means the same level of utility. For choice sets, we must consider the amount of income (called the compensating variation) that I must pay a consumer faced with a diminished choice set (either an alternative is missing or diminished by omission of a feature) so that consumer attains the same level of utility as a consumer facing a better choice set (with the alternative restored or with the feature added). Consumers evaluate choices a priori or before choices are made. Features are valuable to the extent to which they enhance the attainable utility of choice. Consumers do not know the realization of the random utility errors until they are confronted with a choice task. Addition of the feature shifts the deterministic portion of utility or the mean of the random utility. Variation around the mean due to the random utility errors is equally important as a source of value. The random utility model was designed for application to revealed preference or actual choice in the marketplace. The random errors are thought to represent information unobservable to the researcher. This unobservable information could be omitted characteristics that make particular alternatives more attractive than others. In a time series context, the omitted variables could be inventory which affects the marginal utility of consumption. In a conjoint survey exercise, respondents are explicitly asked to make choices solely on the basis of attributes and levels presented and to assume that all other omitted characteristics are to be assumed to be the same. It might be argued, then, that the role of random utility errors is different in the conjoint context. Random utility errors might be more the result of measurement error rather than omitted variables that influence the marginal utility of each alternative. However, even in conjoint setting, we believe it is still possible to interpret the random utility errors as representing a source of unobservable utility. For example, conjoint studies often include brand names as attributes. In these situations, respondents may infer that other characteristics correlated with the brand name are present even though the survey instructions tell them not to make these attributions. One can also interpret the random utility errors as arising from functional form mis-specification. That is, we know that the assumption of a linear utility model (no curvature and no interactions between attributes) is a simplification at best. We can also take the point of view that a consumer is evaluating a choice set prior to the realization of the random utility errors that occur during the purchase period. For example, I consider the value of choice in the smartphone category at some point prior to a purchase decision. At the point, I know the distribution of random utility errors that will depend on features I have not yet discovered or from demand for features which is not yet realized (i.e., I will realize that I will get 345 a great deal of benefit from a better browser). When I go to purchase a smartphone, I will know the realization of these random utility errors. To evaluate the utility afforded by a choice set, we must consider the distribution of the maximum utility obtained across all choice alternatives. This maximum has a distribution because of the random utility errors. For example, suppose we add the feature to a product configuration that is far from utility maximizing. It may still be that, even with the feature, the maximum deterministic utility is provided by a choice alternative without the feature. This does not mean that feature has no value simply because the product it is being added to is dominated by other alternatives in terms of deterministic utility. The alternative with the feature added can be chosen after realization of the random utility errors if the realization of the random utility error is very high for the alternative that is enhanced by addition of the feature. The evaluation of true WTP involves the change in the expected maximum utility for a set of offerings with and without the enhanced product feature. We refer the reader to the more technical paper by Allenby et al. (2013) for its derivation, and simply show the formula below to illustrate its difference from the p-WTP formula described above: where a* is the enhanced level of the attribute. In this formulation, the value of an enhanced level of an attribute is greater when the choice alternative has higher initial value. 2.4 WTB In some analyses, product features are valued using a “Willingness To Buy” concept. WTB is the change in market share that will occur if the feature is added to a specific product. where MS(j) is the market share equation for product j. The market share depends on the entire price vector and the configuration of the choice set. This equation holds prices fixed as the feature is enhanced or added. The market share equations are obtained by summing up the logit probabilities over possibly heterogeneous (in terms of taste parameters) customers. The WTB measure does depend on which product the feature is added to (even a world with identical or homogeneous customers) and, thereby, remedies one of the defects of the pseudo-WTP measure. However, WTB assumes that firms will not alter prices in response to a change in the set of products in the marketplace as the feature is added or enhanced. In most competitive situations, if a firm enhances its product and the other competing products remain unchanged, we would expect the focal firm to be able to command a somewhat higher price, while the other firms’ offerings would decline in demand and therefore, the competing firms would reduce their price or add other features. 2.5 Why p-WTP, WTP and WTB are Inadequate Pseudo-WTP, WTP and WTB do not take into account equilibrium adjustments in the market as one of the products is enhanced by addition of a feature. For this reason, we cannot view either pseudo-WTP nor WTP as what a firm can charge for a feature-enhanced product nor can 346 we view WTB as the market share than can be gained by feature enhancement. Computation of changes in the market equilibrium due to feature enhancement of one product will be required to develop a measure of the economic value of the feature. WTP will overstate the price premium afforded by feature enhancement and WTB will also overstate the impact of feature enhancement on market share. Equilibrium computations in differentiated product cases are difficult to illustrate by simple graphical means. In this section, we will use the standard demand and supply graphs to provide an informal intuition as to why p-WTP and WTB will tend to overstate the benefits of feature enhancement. Figure 1 shows a standard industry supply and demand set-up. The demand curve is represented by the blue downward sloping lines. “D” denotes demand without the feature and “D*” denotes demand with the feature. The vertical difference between the two demand curves is the change in WTP as the feature is added. We assume that addition of the feature may increase the marginal cost of production (note: for some features such as those created purely via software, the marginal cost will not change). It is easy to see that, in this case, the change in WTP exceeds the change in equilibrium price. A similar argument can be made to illustrate that WTB will exceed the actual change in demand in a competitive market. Figure 1 Difficulties with WTP 347 3. ECONOMIC VALUATION OF FEATURES The goal of feature enhancement is to improve profitability of the firm introducing product with feature enhancement into an existing market. Similarly, the value of a patent is ultimately derived from the profits that accrue to firms who practice the patent by developing products that utilize the patented technology. In fact, the standard economic argument for allowing patent holders to sell their patents is that, in this way, patents will eventually find their way into the hands of those firms who can best utilize the technology to maximize demand and profits. For these reasons, we believe that the appropriate measure of the economic value of feature enhancement is the incremental profits that the feature enhancement will generate. Profits, π, is associated with the industry equilibrium prices and shares given a particular set of competing products which is represented by the choice set defined by the attribute matrix. This definition allows for both price and share adjustment as a result of feature enhancement, removing some of the objections to the p-WTP, WTP and WTB concepts. Incremental profits is closer in spirit, though not the same, to the definition of true WTP in the sense that profits depend on the entire choice set and the incremental profits may depend on which product is subject to feature enhancement. However, social WTP does not include cost considerations and does not address how the social surplus is divided between the firm and the customers. In the abstract, our definition of economic value of feature enhancement seems to be the appropriate measure for the firm that seeks to enhance a feature. All funds have an opportunity cost and the incremental profits calculation is fundamental to deploying product development resources optimally. In fairness, industry practitioners of conjoint analysis also appreciate some of the benefits of an incremental profits orientation. Often marketing research firms construct “market simulators” that simulate market shares given a specific set of products in the market. Some even go further as to attempt to compute the “optimal” price by simulating different market shares corresponding to different “pricing scenarios.” In these exercises, practitioners fix competing prices at a set of prices that may include their informal estimate of competitor response. This is not the same as computing a marketing equilibrium but moves in that direction. 3.1 Assumptions Once that principle of incremental profits is adopted, the problem becomes to define the nature of competition, the competitive set and to choose an equilibrium concept. These assumptions must be added to the assumptions of a specific parametric demand system (we will use a heterogeneous logit demand system which is flexible but still parametric) as well as a linear utility function over attributes and the assumption (implicit in all conjoint analysis) that products can be well described by bundles of attributes. Added to these assumptions, our valuation method will also require cost information. Specifically, we will assume 1. Demand Specification: A standard heterogeneous logit demand that is linear in the attributes (including price). 2. Cost Specification: Constant marginal cost. 3. Single product firms. 4. Feature Exclusivity: The feature can only be added to one product. 348 5. No Exit: Firms cannot exit or enter the market after product enhancement takes place. 6. Static Nash Price Competition: There is a set of prices from which each individual firm would be worse off if they deviated from the equilibrium. Assumptions 2, 3, 4 can be easily relaxed. Assumption 1 can be replaced by any valid demand system. Assumptions 5 and 6 cannot be relaxed without imparting considerable complexity to the equilibrium computations. 4. USING CONJOINT ANALYSIS FOR EQUILIBRIUM CALCULATIONS Economic valuation of feature enhancement requires a valid and realistic demand system as well as cost information and assumptions about the set of competitive products. If conjoint studies are to be used to calibrate the demand system, then particular care must be taken to design a realistic conjoint exercise. The low cost of fielding and analyzing a conjoint design makes this method particularly appealing in a litigation context. In addition, with Internet panels, conjoint studies can be fielded and analyzed in a matter of days, a time frame also attractive in the tight schedules of patent litigation. However, there is no substitute for careful conjoint design. Many designs fielded today are not useful for economic valuation of feature enhancement. For example, in recent litigation, conjoint studies in which there is no outside option, only one brand, and only patented features were used. A study with any of these limitations is of questionable value for true economic valuation. Careful practitioners of conjoint have long been aware that conjoint is appealing because of its simplicity and low cost but that careful studies make all the difference between realistic predictions of demand and useless results. We will not repeat the many prescriptions for careful survey analysis which include thorough crafting questionnaires with terminology that is meaningful to respondents, thorough and documented pre-testing and representative (projectable) samples. Furthermore, many of the prescriptions for conjoint design including well-specified and meaningful attributes and levels are extremely important. Instead, we will focus on the areas we feel are especially important for economic valuation and not considered carefully enough. 4.1 Set of Competing Products The guiding principle in conjoint design for economic valuation of feature enhancement is that the conjoint survey must closely approximate the marketplace confronting consumers. In industry applications, the feature enhancement has typically not yet been introduced into the marketplace (hence the appeal of a conjoint study), while in patent litigation the survey is being used to approximate demand conditions at some point in the past in which patent infringement is alleged to have occurred. Most practitioners of conjoint are aware that, for realistic market simulations, the major competing products must be used. This means that the product attributes in the study should include not only functional attributes such as screen size, memory, etc., but also the major brands. This point is articulated well in (Orme, 2001). However, in many litigation contexts, the view is that only the products and brands accused of patent infringement should be included in the study. The idea is that only a certain brand’s products are accused of infringement and, therefore, that the only relevant feature enhancement for the purposes of computing patent damages are feature enhancement in the accused products. 349 For example, in recent litigation, Samsung has accused Apple iOS devices of infringing certain patents owned by Samsung. The view of the litigators is that a certain feature (for example, a certain type of video capture and transmission) infringes a Samsung patent. Therefore, the only relevant feature enhancement is to consider the addition or deletion of this feature on iOS devices such as the iPhone, iPad and iPod touch. This is correct but only in a narrow sense. The hypothetical situation relevant to damages in that case is only the addition of the feature to relevant Apple products. However, the economic value of that enhancement depends on the other competing products in the marketplace. Thus, a conjoint survey which only uses Apple products in developing conjoint profiles cannot be used for economic valuation. The value of a feature in the marketplace is determined by the set of alternative products. For example, in a highly competitive product category with many highly substitutable products, the economic value or increment profits that could accrue to any one competitor would typically be very small. However, in an isolated part of the product space (that is a part of the attribute space that is not densely filled in with competing products), a firm may capture more of the value to consumers of a feature enhancement. For example, if a certain feature is added to an Android device, this may cause greater harm to Samsung in terms of lost sales/profits because smart devices in the Android market segment (of which Samsung is a part) are more inter-substitutable. It is possible that addition of the same feature to the iOS segment may be more valuable as Apple iOS products may be viewed as less substitutable with Android products than other Android products. We emphasize that these examples are simply conjectures to illustrate the point that a full set of competing products must be used in the conjoint study. We do not think it necessary to have all possible product variants or competitors in the conjoint study and subsequent equilibrium computations. In many product categories, this would require a massive set of possible products with many features. Our view is that it is important to design the study to consider the major competing products both in terms of brands and the attributes used in the conjoint design. It is not required that the conjoint study exactly mirror the complete set of products and brands that are in the marketplace but that the main exemplars of competing brands and product positions must be included. 4.2 Outside Option There is considerable debate as to the merits of including an outside option in conjoint studies. Many practitioners use a “forced-choice” conjoint design in which respondents are forced to choose one from the set product profiles in each conjoint choice task. The view is that “forced-choice” will elicit more information from the respondents about the tradeoffs between product attributes. If the “outside” or “none of the above” option is included, advocates of forced choice argue that respondents may shy away from the cognitively more demanding task of assessing tradeoffs and select the “none” option to reduce cognitive effort. On the opposite side, other practitioners advocate inclusion of the outside option in order to assess whether or not the product profiles used in the conjoint study are realistic in the sense of attracting considerable demand. The idea being that if respondents select the “none of the above” option too frequently then the conjoint design has offered very unattractive hypothetical products. Still others (see, for example, 9Brazell et al., 2006) argue the opposite side of the argument for forced choice. They argue that there is a “demand” effect in which respondents select at least one product to “please” the investigator. There is also a large literature on how to implement the “outside” option. 350 Whether or not the outside option is included depends on the ultimate use of the conjoint study. Clearly, it is possible to measure how respondents trade-off different product attributes against each other without inclusion of the outside option. For example, it is possible to estimate the price coefficient in a conjoint study which does not include the outside option. Under the assumption that all respondents are NOT budget constrained, the price coefficient should theoretically measure the trade-offs between other attributes and price. The fact that respondents might select a lower price and pass on some features means that they have an implicit valuation of the dollar savings involved in this trade-off. If all respondents are standard economic agents in the sense that they engage in constrained utility maximization, then this valuation of dollar savings is a valid estimate of the marginal utility of income. This means that a conjoint study without the outside option can be used to compute the p-WTP measure, which only requires a valid price coefficient. We have argued that p-WTP is not a measure of the economic value to the firm of feature enhancement. This requires a complete demand system (including the outside good) as well as the competitive and cost conditions. In order to compute valid equilibrium prices, we need to explicitly consider substitution from and to other goods including the outside good. For example, suppose we enhance a product with a very valuable new feature. We would expect to capture sales from other products in the category as well as to expand the category sales; the introduction of the Apple iPad dramatically grew the tablet category due, in part, to the features incorporated in the iPad. Chintgunta and Nair (2011) make a related observation that price elasticities will be biased if the outside option is not included. We conclude that an outside option is essential for economic valuation of feature enhancement as the only way to incorporate substitution in and out of the category is by the addition of the outside option. At this point, it is possible to take the view that if respondents are pure economic actors that they should select the outside option corresponding to their true preferences and that their choices will properly reflect the marginal utility of income. However, there is a growing literature which suggests that different ways of expressing or allowing for the outside option will change the frequency with which it is selected. In particular, the so-called “dual response” way of allowing for the outside option (see Uldry et al., 2002 and Brazell et al., 2006) has been found to increase the frequency of selection of the outside option. The “dual-response” method asks the respondent first to indicate which of the product profiles (without the outside option) are most preferred and then asked if the respondent would actually buy the product at the price posted in the conjoint design. Our own experience confirms that this mode of including the outside option greatly increases the selection of the outside option. Our experience has also been that the traditional method of including the outside option often elicits a very low rate of selection which we view as unrealistic. The advocates of the “dual response” method argue that the method helps to reduce a conjoint survey bias toward higher purchase rates than in the actual marketplace. Another way of reducing bias toward higher purchase rates is to design a conjoint using an “incentive-compatible” scheme in which the conjoint responses have real monetary consequences. There are a number of ways to do this (see, for example, Ding et al., 2005) but most suggestions (an interesting exception is Dong et al., 2010) use some sort of actual product and a monetary allotment. If the products in the study are actual products in the marketplace, then the respondent might actually receive the product chosen (or, perhaps, be eligible for a lottery which would award the product with some probability). If the respondent selects the outside option, they would receive a cash transfer (or equivalent lottery eligibility). 351 4.3 Estimating Price Sensitivity Both WTP and equilibrium prices are sensitive to inferences regarding the price coefficient. If the distribution of prices puts any mass at all on positive values, then there does not exist a finite equilibrium price. All firms will raise prices infinitely, effectively firing all consumers with negative price sensitivity and make infinite profits on the segment with positive price sensitivity. Most investigators regard positive price coefficients as inconsistent with rational behavior. However, it will be very difficult for a normal model to drive the mass over the positive half line for price sensitivity to a negligible quantity if there is mass near zero on the negative side. We must distinguish uncertainty in posterior inference from irrational behavior. If a number of respondents have posteriors for price coefficients that put most mass on positive values, this suggests a design error in the conjoint study; perhaps, respondents are using price an a proxy for the quality of omitted features and ignoring the “all other things equals” survey instructions. In this case, the conjoint data should be discarded and the study re-designed. On the other hand, we find considerable mass on positive values simply because of the normal assumption and the fact that we have very little information about each respondent. In these situations, we have found it helpful to change the prior or random effect distribution to impose a sign constraint on the price coefficient. In many conjoint studies, the goal is to simulate market shares for some set of products. Market shares can be relatively insensitive to the distribution of the price coefficients when prices are fixed to values typically encountered in the marketplace. It is only when one considers relative prices that are unusual or relatively high or low prices that the implications of a distribution of price sensitivity will be felt. By definition, price optimization will stress-test the conjoint exercise by considering prices outside the small range usually considered in market simulators. For this reason, the quality standards for design and analysis of conjoint data have to be much higher when used from economic valuation than for many of the typical uses for conjoint. Unless the distribution of price sensitivity puts little mass near zero, the conjoint data will not be useful for economic valuation using either our equilibrium approach or for the use of the more traditional and flawed p-WTP methods. 5. ILLUSTRATION To illustrate our proposed method for economic valuation and to contrast our method with standard p-WTP methods, we consider the example of the digital camera market. We designed a conjoint survey to estimate the demand for features in the point and shoot submarket. We considered the following seven features with associated levels: 1. 2. 3. 4. 5. 6. 7. Brand: Canon, Sony, Nikon, Panasonic Pixels: 10, 16 mega-pixels Zoom: 4x, 10x optical Video: HD (720p), Full HD (1080p) and mic Swivel Screen: No, Yes WiFi: No, Yes Price: $79–279 We focused on evaluating the economic value of the swivel screen feature which is illustrated in Figure 2. The conjoint design was a standard fractional factorial design in which each 352 respondent viewed sixteen choice sets, each of which featured four hypothetical products. A dual response mode was used to incorporate the outside option. Respondents were first asked which of the four profiles presented in each choice task was most preferred. Then the respondent was asked if they would buy the preferred profile at the stated price. If no, then this response is recorded as the “outside option” or “none of the above.” Respondents were screened to only those who owned a point and shoot digital camera and who considered themselves to be a major contributor to the decision to purchase this camera. Figure 2 Swivel Screen Attribute Details of the study, its sampling frame, number of respondents and details of estimation are provided in Allenby et al., 2013. We focus here on some of the important summary findings: 1. The p-WTP measure of the swivel screen attribute is $63. 2. The WTP measure of the swivel screen attribute is $13. 3. The equilibrium change in profits is estimated to be $25. We find that the p-WTP measure dramatically overstates the economic value of a product feature, and that the more economic-based measures are more reasonable. 6. CONCLUSION Valuation of product features is an important part of the development and marketing of new products as well as the valuation of patents that are related to feature enhancement. We take the position that the most sensible measure of the economic value of a feature enhancement (either the addition of a completely new feature or the enhancement of an existing feature) is incremental profits. That is, we compare the equilibrium outcomes in a marketplace in which one of the products (corresponding to the focal firm) is feature enhanced with the equilibrium profits in the same marketplace but where the focal firm’s product is not feature enhanced. This measure of economic value can be used to make decisions about the development of new features or to choose between a set of features that could be enhanced. In the patent litigation setting, the value of the patent as well as the damages that may have occurred due to patent infringement should be based on an incremental profits concept. Conjoint studies can play a vital role in feature valuation provided that they are properly designed, analyzed, and supplemented by information on the competitive and cost structure of 353 the marketplace in which the feature-enhanced product is introduced. Conjoint methods can be used to develop a demand system but require careful attention to the inclusion of the outside option and inclusion of the relevant competing brands. Proper negativity constraints must be used to restrict the price coefficients to negative values. In addition, the Nash equilibrium prices computed on the basis of the conjoint-constructed demand system are sensitive to the precision of inference with respect to price sensitivity. This may mean larger and more informative samples than typically used in conjoint applications today. We explain why the current practice of using a change in “WTP” as a way valuing a feature is not a valid measure of economic value. In particular, the calculations done today involving dividing the part-worths by the price coefficient are not even proper measures of WTP. Current pseudo-WTP measures have a tendency to overstate the economic value of feature enhancement as they are only measures of shifts in demand and do not take into account the competitive response to the feature enhancement. In general, firms competing against the focal featureenhanced product will adjust their prices downward in response to the more formidable competition afforded by the feature enhanced product. In addition, WTB analyses will also overstate the effects of feature enhancement on market share or sales as these analyses also do not take into account the fact that a new equilibrium will prevail in the market after feature enhancement takes place. We illustrate our method by an application in the point and shoot digital camera market. We consider the addition of a swivel screen display to a point and shoot digital camera product. We designed and fielded a conjoint survey with all of the major brands and other major product features. Our equilibrium computations show that the economic value of the swivel screen is substantial and discernible from zero but about one half of the pseudo WTP measure commonly employed. Greg Allenby Peter Rossi REFERENCES Allenby, G.M., J.D. Brazell, J.R. Howell and P.E.Rossi (2014) “Economic Valuation of Product Features,” working paper, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2359003 Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in Market Equilibrium,” Econometrica, 63(4), 841–890. 354 Brazell, J., C. Diener, E. Karniouchina, W. Moore, V. Severin, and P.-F. Uldry (2006): “The NoChoice Option and Dual Response Choice Designs,” Marketing Letters, 17(4), 255–268. Chintagunta, P. K., and H. Nair (2011): “Discrete-Choice Models of Consumer Demand in Marketing,” Marketing Science, 30(6), 977–996. Ding, M., R. Grewal, and J. Liechty (2005): “Incentive-Aligned Conjoint,” Journal of Marketing Research, 42(1), 67–82. Dong, S., M. Ding, and J. Huber (2010): “A SimpleMechanism to Incentive-align Conjoint Experiments,” International Journal of Research in Marketing, 27(25–32). McFadden, D. L. (1981): “Econometric Models of Probabilistic Choice,” in Structural Analysis of Discrete Choice, ed. by M. Intrilligator, and Z. Griliches, pp. 1395–1457. North-Holland. Ofek, E., and V. Srinivasan (2002): “How Much Does the Market Value an Improvement in a Product Attribute,” Marketing Science, 21(4), 398–411. Orme, B. K. (2001): “Assessing the Monetary Value of Attribute Levels with Conjoint Analysis,” Discussion paper, Sawtooth Software, Inc. Petrin, A. (2002): “Quantifying the Benefits of New Products: The Case of the Minivan,” Journal of Political Economy, 110(4), 705–729. Rossi, P. E., G. M. Allenby, and R. E. McCulloch (2005): Bayesian Statistics and Marketing. John Wiley & Sons. Sonnier, G., A. Ainslie, and T. Otter (2007): “Heterogeneity Distributions of Willingness-to-Pay in Choice Models,” Quantitative Marketing and Economics, 5, 313–331. Trajtenberg, M. (1989): “The Welfare Analysis of Product Innovations, with an Application to Computed Tomography Scanners,” Journal of Political Economy, 97(2), 444–479. Uldry, P., V. Severin, and C. Diener (2002): “Using a Dual Response Framework in Choice Modeling,” in AMA Advanced Research Techniques Forum. 355 THE BALLAD OF BEST AND WORST TATIANA DYACHENKO REBECCA WALKER NAYLOR GREG ALLENBY OHIO STATE UNIVERSITY “Best is best and worst is worst, and never the twain shall meet Till both are brought to one accord in a model that’s hard to beat.” (with apologies to Rudyard Kipling) In this paper, we investigate the psychological processes underlying the Best-Worst choice procedure. We find evidence for sequential evaluation in Best-Worst tasks that is accompanied by sequence scaling and question framing scaling effects. We propose a model that accounts for these effects and show superiority of our model over currently used models of single evaluation. INTRODUCTION Researchers in marketing are constantly developing tools and methodologies to improve the quality of inferences about consumer preferences. Examples include the use of hierarchical models of marketplace data and the development of novel ways of data collection in surveys. An important aspect of this research involves validating and testing models that have been proposed. In this paper, we take a look at one of these relatively new methods called “Maximum Difference Scaling,” also known as Best-Worst choice tasks. This method was proposed (Finn & Louviere, 1992) to address the concern that insufficient information is collected for each individual respondent in discrete choices experiments. In Best-Worst choice tasks, each respondent is asked to make two selections: select the best, or most preferred, alternative and the worst, least preferred, alternative from a list of items. Thus, the tool allows the researcher to collect twice as many responses on the same number of choices tasks from the same respondent. This tool has been extensively studied and compared to other tools available to marketing researchers (Bacon et al., 2007; Wirth, 2010; Marley et al., 2005; Marley et al., 2008). MaxDiff became popular in practice due its superior performance (Wirth, 2010) compared to traditional choice-based tasks in which only one response, the best alternative, is collected. While we applaud the development of a tool that addresses the need for better inferences, we believe that marketing researchers need to deeply think about analysis of the data coming from the tool. It is important to understand assumptions that are built into the models that estimate parameters related to consumer preferences. The main assumption that underlies current analysis of MaxDiff data is the assumption of equivalency of the “select-the-best” and “select-the-worst” responses, meaning that two pieces of information from two response subsets contain the same quality of information that can be extracted to make inferences about consumer preferences. We can test this assumption of equivalency of information by performing the following analysis. We can split the Best-Worst 357 data into two sub-datasets—“best” only responses and “worst” only responses. If we assume that the assumption that respondents rank items from top to bottom is true, then we can run the same model to obtain inferences of preference parameters in both data subsets. If the assumption of one-time ranking is true, the inference parameters recovered from these data sets should be almost the same or close to each other. Figure 1 shows the findings from performing this analysis on an actual dataset generated by the Best-Worst task. We plotted the means of estimated preference parameters from the “selectthe-best” only responses on the horizontal line and from “select-the-worst” on the vertical line. If the two subsets from the Best-Worst data contained the same quality of information about preference parameters, , then all points would have been on or close to the 45-degree line. We see that there are two types of systematic non-equivalence of the two datasets. First, the datasets differ in the size of the range that parameters spin for Best and Worst responses. This result is interesting, as it would indicate more consistency in Best responses than in Worst. Second, there seems to be a possible relationship between the Best and Worst parameters. These two factors indicate that the current model’s assumption of single, or one-time, evaluation in Best-Worst tasks should be re-evaluated as the actual data do not support this assumption. Figure 1. Means of preference parameters estimated from the “select-the-best” subset (on the horizontal line) and from the “select-the-worst” subset (on the vertical line) PROPOSED APPROACH To understand the results presented in Figure 1, we take a deeper look at how the decisions in the Best-Worst choice tasks are made, that is, we consider possible data-generating mechanisms in these tasks. To think about these processes, we turned to the psychology literature that presents 358 a vast amount of evidence that would point to the fact that we should not expect the information collected from the “select-the-best” and “select-the-worst” decisions to be equivalent. This literature also provides a source of multiple theories that can help drive how we think about and analyze the data from Best-Worst choice tasks. In this paper, we present an approach that takes advantage of theories in the psychology literature. We allow these psychological theories to drive the development of the model specification for the Best-Worst choice tasks. We incorporate several elements into the mathematical expression of the model, the presence of which is driven by specific theories. The first component is related to sequential answering of the questions in the Best-Worst choice tasks. Sequential evaluation is one of the simplifying mechanisms that we believe is used by respondents in these tasks. This mechanism allows for two possible sequences of decision: selecting the best first and then moving to selecting the worst alternative, or answering the “worst” question first and then choosing the “best” alternative. This is in contrast to the assumptions of the two current models developed for these tasks: single ranking as we described above and a pairwise comparison of all items presented in the choice task, where people are assumed to maximize the distance between the two items. The sequential decision making that we assume in our model generates two different conditions under which respondents provide their answers because there are different numbers of items that people evaluate in the first and the second decision. In the first response, there is a full list of items from which to make a choice, while the second choice involves a subset of items with one fewer alternative because the first selected item is excluded from the subsequent decision making. This possibly changes the difficulty of the tasks as respondents move from the first to the second choice, making the second decision easier with respect to the number of items that need to be processed. This change in the task in the second decision is represented in our model through parameter ψ (order effect), which is expected to be greater than one to reflect that the second decision is less error prone because it is easier. Another component of our model deals with the nature of the “select-the-best” and “selectthe-worst” questions. We believe that there is another effect that can be accounted for in our sequential evaluation model for these choice tasks that cannot be included in any of the singleevaluation models. As discussed above, the “sequential evaluation” means that there are two questions in Best-Worst tasks: “select-the-best” and “select-the worst” that are answered sequentially. But these two questions require switching between the mindsets that drive the responses to two questions. To select the best alternative, a person retrieves experiences and memories that are congruent with the question at hand—“select-the-best.” The other question, “select-the-worst” is framed such that another, possibly overlapping, set of memories and associations are retrieved that is more congruent with that question. The process of such biased memory retrieval is described in the psychology literature exploring hypothesis testing theory and the confirmation bias (Snyder, 1981; Hoch and Ha, 1986). This literature suggests that people are more likely to attend to information that is consistent with a hypothesis at hand. In the Best-Worst choice tasks, the temporary hypothesis in the “Best” question is “find the best alternative,” so that people are likely to attend mostly to memories related to the best or most important things that happened to them. The “Worst” question would generate another hypothesis “select the worst” that people would be trying to 359 confirm. This would create a different mental frame making people think about other, possibly bad or less important, experiences to answer that question. The subsets of memories from the two questions might be different or overlap partially. We believe that there is an overlap and, thus, the differences in preference parameters between the two questions can be represented by the change in scale. This is a scale parameter λ (question framing effect) in our model. However, if the retrievals in the two questions are independent and generate different samples of memories, then a model where we allow for independent preference β parameters would perform better than the model that only adjusts the scale parameter. The third component of the model is related to the error term distribution in models for BestWorst choices tasks. Traditionally, a logit specification that is based on the maximum extreme value assumption of the error term is used. This is mostly due to the mathematical and computational convenience of these models: the probability expressions have closed forms and, hence, the model can be estimated relatively fast. We, however, want to give the error term distributional assumption serious consideration by thinking about more appropriate reasons for the use of extreme value (asymmetric) versus normal (symmetric) distributional assumptions. The question is: can we use the psychology literature to help us justify the use of one specification versus another? As an example of how it can be done, we use the theory of episodic versus semantic memory retrieval and processing (Tulvin, 1972). When answering Best-Worst questions, people need to summarize the subsets of information that were just retrieved from memory. If the memories and associations are aggregated by averaging (or summing) over the episodes and experiences (which would be consistent with a semantic information processing and retrieval mechanism), then that would be consistent with the use of the normally distributed (symmetric) error term due to the Central Limit Theorem. However, if respondents pay attention to specific episodes within these samples of information looking for the most representative episodes to answer the question at hand (which would be consistent with an episodic memory processing mechanism), then the extreme value error term assumption would be justified. This is due to Extreme Value Theory, which says that the maximum, or minimum, of a random variable is distributed Max, or Min, extreme value. Thus, in the “select-the-best” decision it is appropriate to use the maximum extreme value error term, and in the “select-the-worst” question, the minimum extreme value distribution is justified. Equation 1 is the model for one Best-Worst decision task. This equation shows the model based on the episodic memory processing, or extreme value error terms. It includes two possible sequences, that are indexed by parameter θ, order scale parameter ψ in the second decision, exclusion of the first choice from the set in the second decision, and our question framing scaling parameter λ. The model with the normal error term assumption has the same conceptual structure but the choice probabilities have different expressions. 360 Equation 1. Sequential Evaluation Model (logit specification) This model is a generalized model and includes some existing models as special cases. For example, if we use the probability weight instead of sequence indicator θ, then under specific values of that parameter, our model would include the traditional MaxDiff model. The concordant model by Marley et al. (2005) would also be a special case of our modified model. EMPIRICAL APPLICATION AND RESULTS We applied our model to data that was collected from an SSI panel. Respondents went through 15 choices tasks with five items each as is shown in Figure 2. The items came from a list of 15 hair care concerns and issues. We analyzed responses from 594 female respondents over 50 years old. This sample of the population is known for high involvement with the hair care category. For example, in our sample, 65% of respondents expressed some level of involvement with the category. Figure 2. Best-Worst task We estimated our proposed models with and without the proposed effects. We used Hierarchical Bayesian estimation where preference parameters β, order effect ψ and context effect λ are heterogeneous. To ensure empirical identification, the latent sequence parameter θ is estimated as an indicator parameter from Bernoulli distribution and is assumed to be the same for all respondents. We use standard priors for the parameters of interest. Table 1 shows the improvement of model fit (log marginal density, Newton-Raftery estimator) as the result of the presence of each effect, that is, the marginal effect of each model element. Table 2 shows in-sample and holdout hit probabilities for the Best-Worst pair (random chance is 0.05). 361 Table 1. Model Fit Exploded logit (single evaluation) Context effect only Order effect only Context and order effects together LMD NR -13,040 -12,455 -11,755 -11,051 These tables show significant improvement in fit from each of the components of the model. The strongest improvement comes from the order effect, indicating that the sequential mechanisms we assumed are more plausible given the data than the model with the assumption of single evaluation. The context effect improves the fit as well, indicating that it is likely that the two questions, “select-the-best” and “select-the-worst,” are processed differently by respondents. The model with both effects included into the model is the best model not just with respect to fit to the data in-sample, but also in terms of holdout performance. Table 2. Improvement in Model Fit 0.3062 In-sample Improvement * - 0.3168 0.3443 0.3789 3.5 % 12.4% 23.7% In-sample Hit Probabilities Exploded logit (single evaluation) Context effect only Order effect only Context and order effects together 0.2173 Holdout Improvement * - 0.2226 0.2356 0.2499 2.4% 8.4% 15.0% Holdout Hit Probabilities * Improvements are calculated over the metric in first line, which comes from the model that assumes single evaluation (ranking) in Best-Worst tasks. We found that both error term assumptions (symmetric and asymmetric) are plausible as the fit of the models are very similar. Based on that finding, we can recommend using our sequential logit model, as it has computational advantages over the sequential probit model. The remaining results we present are based on the sequential logit model. We also found that the presence of dependent preference parameters between the “Best” and “Worst” questions (question framing scale effect λ) is a better fitting assumption than the assumption of independence of β’s from the two questions. From a managerial standpoint, we want to show why it is important to use our sequential evaluation model instead of single evaluation models. We compared individual preference parameters from two models: our best performing model and exploded logit specification (single-evaluation ranking model). Table 3 shows the proportion of respondents for whom a subset of top items is the same between these two models. For example, for the top 3 items related to the hair care concerns and issues, the two models agree only for 61% of respondents. If we take into account the order within these subsets, then the matching proportion drops to 46%. This means that for more than half of respondents in our study, the findings and recommendations will be different between the two models. Given the fact that our model of 362 sequential evaluation is a better fitting model, we suggest that the results from single evaluation models can be misleading for managerial implications and that the results from our sequential evaluation model should be used. Table 3. Proportion of respondents matched on top n items of importance between sequential and single evaluation (exploded logit) models Top n items 1 2 3 4 5 6 Proportion of respondents (Order does not matter) 83.7% 72.1% 61.1% 53.2% 47.0% 37.7% Proportion of respondents (Order does matter) 83.7% 65.0% 46.5% 29.1% 18.4% 10.4% Our sequential evaluation model also provides additional insights about the processes that are present in Best-Worst choice tasks. First, we found that in these tasks respondents are more likely to eliminate the worst alternative from the list and then select the best one. This is consistent with literature that suggests that people, when presented with multiple alternatives, are more likely to simplify the task by eliminating, or screening-out, some options (Ordonez, 1999; Beach and Potter, 1992). Given the nature of the task in our application, where respondents had to select the most important and least important items, it is not surprising that eliminating what is not important first would be the most likely strategy. This finding, however, is in contrast with the click data that was collected in these tasks. We found that about 68% of clicks were best-then-worst. To understand this discrepancy, we added the observed sequence information into our model by substituting the indicator of latent sequence θ with the decision order that we observed. Table 4 shows the results of the fit of these models. The data on observed sequence makes the fit of the model worse. This suggests that researchers need to be careful when thinking that click data is a good representation of the latent processes driving consumer decisions in Best-Worst choice tasks. Table 4. Fit of the models with latent and observed sequence of decisions. LMD NR Latent sequence Observed sequence -11,051 -12,392 In-sample Hit Probabilities 0.3789 0.3210 Holdout Hit Probabilities 0.2499 0.2247 To investigate order further, we manipulated the order of the decisions by collecting responses from two groups. One group was forced to select the best alternative first and then select the worst, and the second group was forced to select in the opposite order. We found that the fit of our model with the indicator of latent sequence is the same as for the group that was required to select the worst alternative first. This analysis gives us more confidence in our 363 finding. To understand why the click data seem to be inconsistent with the underlying decision making processes in respondents’ minds is outside of the scope of this paper, but is an important topic for future research. Our model also gives us an opportunity to learn about other effects present in Best-Worst choice tasks and account for those effects. For instance, there is a difference in the certainty level between the first and the second decisions. As expected, the second decision is less error prone than the first. The mean of the posterior distribution of the order effect ψ is greater than one for almost all respondents. This finding is consistent with our expectation that the decrease in the difficulty of the task in the second choice will impact the certainty level. While we haven’t directly tested the impact of the number of items on the list on certainty level, our finding is expected. Another effect that we have included in our model is the scale effect of question framing λ, which represents the level of certainty in the parameters that a researcher obtains between the best and worst selections as the result of the response elicitation procedure—“best” versus “worst.” We found that the average of the sample for this parameter is 1.17, which is greater than one. This means that, on average, respondents in our sample are more consistent in their “worst” choices. However, we found significant heterogeneity in this parameter among respondents. To understand what can explain the heterogeneity in this parameter, we performed a postestimation analysis of the context scale parameter as it relates to an individual’s expertise level, which was also collected in the survey. We found a negative correlation of -0.16 between the means of the context effect parameter and the level of expertise, meaning that experts are likely to be more consistent in what is important to them and non-experts are more consistent about what is not important to them. We also found a significant negative correlation (-0.20) between the direct measure of the difficulty of the “select-the-worst” items and the context effect parameter, indicating that if it was easier to respond to the “select-the-worst” questions, λ was larger which is consistent with our proposition and expectations. CONCLUSIONS In this paper, we proposed a model to analyze data from Best-Worst choice tasks. We showed how the development of model specification could be driven by theories from the psychology literature. We took a deep look at how we can think about the possible processes that underlie decisions in these tasks and how to reflect that in the mathematical representation of the datagenerating mechanism. We found that our proposed model of sequential evaluation is a better fitting model than the currently used models of single evaluation. We showed that adding the sequential nature to the model specification allows other effects to be taken into consideration. We found that the second decision is more certain than the first decision, but the “worst” decision is, on average, more certain. Finally, we demonstrated the managerial implications of the proposed model. Our model that takes into account psychological processes within Best-Worst choice tasks gives different results about what is most important to specific respondents. This finding has direct implications for 364 new product development initiatives and understanding the underlying needs and concerns of customers. Greg Allenby REFERENCES Bacon, L., Lenk, P., Seryakova, K., Veccia, E. (2007) “Making MaxDiff More Informative: Statistical Data Fusion by Way of Latent Variable Modeling,” Sawtooth Software Conference Proceedings, 327–343. Beach, L. R., & Potter, R. E. (1992) “The pre-choice screening of options,” Acta Psychologica, 81(2), 115–126. Finn, A. & Louviere, J. (1992). “Determining the Appropriate Response to Evidence of Public Concern: The Case of Food Safety,” Journal of Public Policy & Marketing, Vol. 11, No. 2 (Fall, 1992), 12–25. Hoch, S. J. and Ha, Y.-W. (1986) “Consumer Learning: Advertising and the Ambiguity of Product Experience,” Journal of Consumer Research , Vol. 13, 221–233. Marley, A. A. J. & Louviere, J.J. (2005). “Some Probabilistic Models of Best, Worst, and BestWorst Choices,” Journal of Mathematical Psychology, 49, 464–480. Marley, A. A. J., Flynn, T.N. & Louviere, J.J. (2008). “Probabilistic Models of Set-dependent and attribute level Best-Worst Choice,” Journal of Mathematical Psychology, 52, 281–296. Ordo ez, L. D., Benson, III, L. and Beach, L. R. (1999), “Testing the Compatibility Test: How Instructions, Accountability, and Anticipated Regret Affect Prechoice Screening of Options,” Organizational Behavior and Human Decision Processes, Vol. 78, 63–80. Snyder, M. (1981) “Seek and ye shall find: Testing hypotheses about other people,” in C. Heiman, E. Higgins and M. Zanna, eds, ‘Social Cognition: The Ontario Symposium on Personality and Social Psychology,’ Hillside, NJ: Erlbaum, 277–303. Tulving, E. (1972) “Episodic and Semantic Memory,” in E. Tulving and W. Donaldson, eds, ‘Organization of Memory,’ Academic press, NY and London, pp. 381–402. 365 Wirth, R. (2010) “HB-CBC, HB-Best-Worst_CBC or NO HB at All,” Sawtooth Software Conference Proceedings, 321–356. 366