View in new window - Sawtooth Software

Transcription

View in new window - Sawtooth Software
PROCEEDINGS OF THE
SAWTOOTH SOFTWARE
CONFERENCE
October 2013
Copyright 2014
All rights reserved. No part of this volume may be reproduced
or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and
retrieval system, without permission in writing from
Sawtooth Software, Inc.
iv
FOREWORD
These proceedings are a written report of the seventeenth Sawtooth Software Conference,
held in Dana Point, California, October 16–18, 2013. Two-hundred ten attendees participated.
This conference included a separate Healthcare Applications in Conjoint Analysis track; however
these proceedings contain only the papers delivered at the Sawtooth Software Conference.
The focus of the Sawtooth Software Conference continues to be quantitative methods in
marketing research. The authors were charged with delivering presentations of value to both the
most sophisticated and least sophisticated attendees. Topics included choice/conjoint analysis,
surveying on mobile platforms, Menu-Based Choice, MaxDiff, hierarchical Bayesian estimation,
latent class procedures, optimization routines, cluster ensemble analysis, and random forests.
The papers and discussant comments are in the words of the authors and very little copy
editing was performed. At the end of each of the papers, we’re pleased to display photographs of
the authors and co-authors who attended the conference. We appreciate their cooperation to sit
for these portraits! It lends a personal touch and makes it easier for readers to recognize them at
the next conference. We are grateful to these authors for continuing to make this conference a
valuable event and advancing our collective knowledge in this exciting field.
Sawtooth Software
June, 2014
vi
CONTENTS
9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS .................................................1
Chris Chapman, Google
QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A TRADITIONAL MANUFACTURING FIRM:
UPDATE AND CASE STUDY .................................................................................................... 13
Robert J. Goodwin, Lifetime Products, Inc.
CAN CONJOINT BE FUN?:
IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS ........................................... 39
Jane Tang & Andrew Grenville, Vision Critical
MAKING CONJOINT MOBILE: ADAPTING CONJOINT TO THE MOBILE PHENOMENON ........... 55
Chris Diener, Rajat Narang, Mohit Shant, Hem Chander & Mukul Goyal, AbsolutData
CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS ........................................................ 69
Joseph White, Maritz Research
USING COMPLEX CHOICE MODELS TO DRIVE BUSINESS DECISIONS ..................................... 83
Karen Fuller, HomeAway, Inc. & Karen Buros, Radius Global Market Research
AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY ....................................... 97
Brent Fuller, Matt Madden & Michael Smith, The Modellers
MAXDIFF AUGMENTATION: EFFORT VS. IMPACT ............................................................... 105
Urszula Jones, TNS & Jing Yeh, Millward Brown
WHEN U = ßX IS NOT ENOUGH: MODELING DIMINISHING RETURNS AMONG CORRELATED
CONJOINT ATTRIBUTES ...................................................................................................... 115
Kevin Lattery, Maritz Research
RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE?
A VARIANCE DECOMPOSITION OF HB UTILITIES ................................................................ 129
Keith Chrzan & Aaron Hill, Sawtooth Software
FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO CREATE VALUE ............... 135
Karlan Witt & Deb Ploskonka, Cambia Information Group
BRAND IMAGERY MEASUREMENT: ASSESSMENT OF CURRENT PRACTICE AND A
NEW APPROACH .............................................................................................................. 147
Paul Richard McCullough, MACRO Consulting, Inc.
i
ACBC REVISITED ............................................................................................................. 165
Marco Hoogerbrugge, Jeroen Hardon & Christopher Fotenos, SKIM Group
RESEARCH SPACE AND REALISTIC PRICING IN SHELF LAYOUT CONJOINT (SLC) ................. 181
Peter Kurz, TNS Infratest, Stefan Binner, bms marketing research + strategy
& Leonhard Kehl, Premium Choice Research & Consulting
ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS ..................................... 195
Dan Yardley, Maritz Research
ANCHORED ADAPTIVE MAXDIFF: APPLICATION IN CONTINUOUS CONCEPT TEST ............... 205
Rosanna Mau, Jane Tang, LeAnn Helmrich & Maggie Cournoyer, Vision Critical
HOW IMPORTANT ARE THE OBVIOUS COMPARISONS IN CBC?
THE IMPACT OF REMOVING EASY CONJOINT TASKS ........................................................... 221
Paul Johnson & Weston Hadlock, SSI
SEGMENTING CHOICE AND NON-CHOICE DATA SIMULTANEOUSLY ................................... 231
Thomas C. Eagle, Eagle Analytics of California
EXTENDING CLUSTER ENSEMBLE ANALYSIS VIA SEMI-SUPERVISED LEARNING ...................... 251
Ewa Nowakowska, GfK Custom Research North America & Joseph Retzer, CMI
Research, Inc.
THE SHAPLEY VALUE IN MARKETING RESEARCH: 15 YEARS AND COUNTING ....................... 267
Michael Conklin & Stan Lipovetsky, GfK
DEMONSTRATING THE NEED AND VALUE FOR A MULTI-OBJECTIVE PRODUCT SEARCH......... 275
Scott Ferguson & Garrett Foster, North Carolina State University
A SIMULATION BASED EVALUATION OF THE PROPERTIES OF ANCHORED MAXDIFF:
STRENGTHS, LIMITATIONS AND RECOMMENDATIONS FOR PRACTICE.................................... 305
Jake Lee, Maritz Research & Jeffrey P. Dotson, Brigham Young University
BEST-WORST CBC CONJOINT APPLIED TO SCHOOL CHOICE:
SEPARATING ASPIRATION FROM AVERSION ........................................................................ 317
Angelyn Fairchild, Research, RTI International, Namika Sagara & Joel Huber, Duke
University
DOES THE ANALYSIS OF MAXDIFF DATA REQUIRE SEPARATE SCALING FACTORS? .............. 331
Jack Horne & Bob Rayner, Market Strategies International
ii
USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF PRODUCT FEATURES .... 341
Greg Allenby, Ohio State University, Jeff Brazell, The Modellers,
John Howell, Penn State University & Peter Rossi, University of California Los Angeles
THE BALLAD OF BEST AND WORST ...................................................................................... 357
Tatiana Dyachenko, Rebecca Walker Naylor & Greg Allenby, Ohio State University
iii
iv
SUMMARY OF FINDINGS
The seventeenth Sawtooth Software Conference was held in Dana Point, California, October
16–18, 2013. The summaries below capture some of the main points of the presentations and
provide a quick overview of the articles available within the 2013 Sawtooth Software
Conference Proceedings.
9 Things Clients Get Wrong about Conjoint Analysis (Chris Chapman, Google): Conjoint
analysis has been used with great success in industry, Chris explained, but this often leads to
some clients having misguided expectations regarding the technique. As a prime example, many
clients are hoping that conjoint analysis will predict the volume of demand. While conjoint can
provide important input to a forecasting model, it usually cannot alone predict volume without
other inputs such as awareness, promotion, channel effects and competitive response. Chris
cautioned about examining average part-worth utility scores only without consideration for the
distribution of preferences (heterogeneity), which often reveal profitable niche strategies. He also
recommended fielding multiple studies with modest sample sizes that examine a business
problem using different approaches rather than fielding one high-budget, large sample size
survey. Finally, Chris stressed that leveraging insights from analytics (such as from conjoint) is
better than relying solely on managerial instincts. It will generally increase the upside potential
and reduce the downside risk for business decisions.
Quantitative Marketing Research Solutions in a Traditional Manufacturing Firm:
Update and Case Study (Robert J. Goodwin, Lifetime Products, Inc.): Bob’s presentation
highlighted the history of Lifetime’s use of conjoint methods to help it design and market its
consumer-oriented product line. He also presented findings regarding a specific test involving
Adaptive CBC (ACBC). Regarding practical lessons learned while executing numerous conjoint
studies at Lifetime, Bob cited not overloading the number of attributes in the list just because the
software can support them. Some attributes might be broken out and investigated using nonconjoint questions. Also, because many respondents do not care much about brand name in the
retailing environment that Lifetime engages, Bob has dropped brand from some of his conjoint
studies. But, when he wants to measure brand equity, Bob uses a simulation method to estimate
the value of a brand, in the context of competitive offerings and the “None” alternative. Finally,
Bob conducted a split-sample test involving ACBC. He found that altering the questionnaire
settings to focus the “near-neighbor design” either more tightly or less tightly around the
respondent’s BYO-specified concept didn’t change results much. This, he argued, demonstrates
the robustness of ACBC results to different questionnaire settings.
Can Conjoint Be Fun?: Improving Respondent Engagement in CBC Experiments (Jane
Tang and Andrew Grenville, Vision Critical): Traditional CBC can be boring for respondents.
This was mentioned in a recent Greenbook blog. Jane gave reasons why we should try to engage
respondents in more interesting surveys, such as a) cleaner data often result, and b) happy
respondents are happy panelists (and panelist cooperation is key). Ways to make surveys more
fun include using adaptive tasks that seem to listen and respond to respondent preferences as
well as feedback mechanisms that report something back to respondents based on their
preferences. Jane and her co-author Andrew did a split-sample test to see if adding adaptive tasks
and a feedback mechanism could improve CBC results. They employed tournament tasks,
wherein concepts that win in earlier tasks are displayed again in later tasks. They also employed
a simple level-counting mechanism to report back the respondent’s preferred product concept.
v
Although their study design didn’t include good holdouts to examine predictive validity, there
was at least modest evidence that the adaptive CBC design had lower error and performed better.
Also, some qualitative evidence suggested that respondents preferred the adaptive survey. After
accounting for scale differences (noise), they found very few differences in the utility parameters
for respondents receiving “fun” versus standard CBC surveys. In sum, Jane suggested that if the
utility results are essentially equivalent, why not let the respondent have more fun?
Making Conjoint Mobile: Adapting Conjoint to the Mobile Phenomenon (Chris Diener,
Rajat Narang, Mohit Shant, Hem Chander, and Mukul Goyal, AbsolutData): Chris and his coauthors examined issues involving the use of mobile devices to complete complex conjoint
studies. Each year more respondents are choosing to complete surveys using mobile devices, so
this topic should interest conjoint analysis researchers. It has been argued that the small screen
size for mobile devices may make it nearly impossible to conduct complex conjoint studies
involving relatively large lists of attributes. The authors conducted a split-sample experiment
involving the US and India using five different kinds of conjoint analysis variants (all sharing the
same attribute list). The variants included standard CBC, partial-profile, and adaptive methods
(ACBC). Chris found only small differences in the utilities or the predictive validity for PCcompleted surveys versus mobile-completed surveys. Surprisingly, mobile respondents generally
reported no readability issues (ability to read the questions and concepts on the screen) compared
to PC respondents. The authors concluded that conjoint studies, even those involving nine
attributes (as in their example) can be done effectively among those who elect to complete the
surveys using their mobile devices (providing researchers keep the surveys short, use proper
conjoint questionnaire settings, and emphasize good aesthetics).
Choice Experiments in Mobile Web Environments (Joseph White, Maritz Research):
Joseph looked at the feasibility of conducting two separate eight-attribute conjoint analysis
studies using PC, tablet, or mobile devices. He compared results based on the Swait-Louviere
test, allowing him to examine results in terms of scale and parameter equivalence. He also
examined internal and external fit criteria. His conjoint questionnaires included full-profile and
partial-profile CBC. Across both conjoint studies, he concluded that respondents who choose to
complete the studies via a mobile device show predictive validity at parity with, or better than,
those who choose to complete the same study via both PC and Tablet. Furthermore, the response
error for mobile-completed surveys is on par with PC. He summed it up by stating, “Consistency
of results in both studies indicate that even more complicated discrete choice experiments can be
readily completed in mobile computing environments.”
Using Complex Choice Models to Drive Business Decisions (Karen Fuller, HomeAway,
Inc. and Karen Buros, Radius Global Market Research): Karen Fuller and Karen Buros jointly
presented a case study involving a complex menu-based choice (MBC) experiment for Fuller’s
company, HomeAway. HomeAway offers an online marketplace for vacation travelers to find
rental properties. Vacation home owners and property managers list rental property on
HomeAway’s website. The challenge for HomeAway was to design the pricing structure and
listing options to better support the needs of owners and to create a better experience for
travelers. Ideally, this would also increase revenues per listing. They developed an online
questionnaire that looked exactly like HomeAway’s website, including three screens to fully
select all options involving creating a listing. This process nearly replicated HomeAway’s
existing enrollment process (so much so that some respondents got confused regarding whether
they had completed a survey or done the real thing). Nearly 2,500 US-based respondents
vi
completed multiple listings (MBC tasks), where the options and pricing varied from task to task.
Later, a similar study was conducted in Europe. CBC software was used to generate the
experimental design, the questionnaire was custom-built, and the data were analyzed using MBC
software. The results led to specific recommendations for management, including the use of a
tiered pricing structure, additional options, and to increase the base annual subscription price.
After implementing many of the suggestions of the model, HomeAway has experienced greater
revenues per listing and the highest renewal rates involving customers choosing the tiered
pricing.
Augmenting Discrete Choice Data—A Q-sort Case Study (Brent Fuller, Matt Madden, and
Michael Smith, The Modellers): Sometimes clients want to field CBC studies that have an
attribute with an unusually large number of levels, such as messaging and promotion attributes.
The problem with such attributes is obtaining enough precision to avoid illogical reversals while
avoiding excessive respondent burden. Mike and Brent presented an approach to augmenting
CBC data with Q-sort rankings for these attributes involving many levels. A Q-sort exercise asks
respondents to sort items into a small number of buckets, but where the number of items
assigned per bucket is fixed by the researcher. The information from the Q-sort can be appended
to the data as a series of inequalities (e.g., level 12 is preferred to level 18) constructed as new
choice tasks. Mike and Brent found that the CBC data without Q-sort augmentation had some
illogical preference orderings for the 19-level attribute. With augmentation, the reversals
disappeared. One problem with augmenting the data is that it can artificially inflate the
importance of the augmented attribute relative to the non-augmented attributes. Solutions to this
problem include scaling back the importances to the original importances (at the individual level)
given from HB estimation of the CBC data prior to augmentation.
MaxDiff Augmentation: Effort vs. Impact (Urszula Jones, TNS and Jing Yeh, Millward
Brown): Ula (Urszula) and Jing described the common challenge that clients want to use
MaxDiff to test a large number of items. With standard rules of thumb (to obtain stable
individual-level estimates), the number of choice tasks becomes very large per respondent.
Previous solutions presented at the Sawtooth Software Conference include augmenting the data
using Q-Sort (Augmented MaxDiff), Express MaxDiff (each respondents sees only a subset of
the items), or Sparse MaxDiff (each respondent sees each item fewer than three times). Ula and
Jing further investigated whether Augmented MaxDiff was worth the additional survey
programming effort (as it is the most complicated) or whether the other approaches were
sufficient. Although the authors didn’t implement holdout tasks that would have given a better
read on predictive validity of the different approaches, they did draw some conclusions. They
concluded a) at the individual level, Sparse MaxDiff is not very precise, but in aggregate the
results are accurate, b) If you have limited time, augmenting data using rankings of the top items
is probably better than augmenting the bottom items, and c) Augmenting on both top and bottom
items is best if you need accurate individual-level results for TURF or clustering.
When U = ßx Is Not Enough: Modeling Diminishing Returns among Correlated
Conjoint Attributes (Kevin Lattery, Maritz Research): When conjoint analysis studies involve a
large number of binary (on/off) features, standard conjoint models tend to over-predict interest in
product concepts loaded up with nearly all the features and to under-predict product concepts
including very few of the features. This occurs because typical conjoint analysis (using main
effects estimation) assumes all the attributes are independent. But, Kevin explained, there often
are diminishing returns when bundling multiple binary attributes (though the problem isn’t
vii
limited to just binary attributes). Kevin reviewed some design principles involving binary
attributes (avoid the situation in which the same number of “on” levels occurs in each product
concept). Next, Kevin discussed different ways to account for the diminishing returns among a
series of binary items. Interaction effects can partially solve the problem (and are a stable and
practical solution if the number of binary items is about 3 or fewer). Another approach is to
introduce a continuous variable for the number of “on” levels within a concept. But, Kevin
proposed a more complete solution that borrows from nested logit. He demonstrated greater
predictive validity to holdouts for the nested logit approach than the approach of including a term
representing the number of “on” levels in the concept. One current drawback, he noted, is that
his solution may be difficult to implement with HB estimation.
Respondent Heterogeneity, Version Effects or Scale? A Variance Decomposition of HB
Utilities (Keith Chrzan and Aaron Hill, Sawtooth Software): When researchers use CBC or
MaxDiff, they hope the utility scores are independent of which version (block) each respondent
received. However, one of the authors, Keith Chrzan, saw a study a few years ago in which
assignment in cluster membership was not independent of questionnaire version (>95%
confidence). This lead to further investigation in which for more than half of the examined
datasets, the authors found a statistically significant version effect upon final estimated utilities
(under methods such as HB). Aaron (who presented the research at the conference) described a
regression model that they built which explained final utilities as a function of a) version effect,
b) scale effect (response error), and c) other. The variance captured in the “other” category is
assumed to be the heterogeneous preferences of respondents. Across multiple datasets, the
average version effect accounted for less than 2% of the variance in final utilities. Scale
accounted for about 11%, with the remaining attributable to substantive differences in
preferences across respondents or other unmeasured sources. Further investigation using
synthetic respondents led the authors to conclude that the version effect was psychological rather
than algorithmic. They concluded that although the version effect is statistically significant, it
isn’t strong enough to really worry about for practical applications.
Fusing Research Data with Social Media Monitoring to Create Value (Karlan Witt and
Deb Ploskonka, Cambia Information Group): Karlan and Deb described the current business
climate, in which social media provides an enormous volume of real-time feedback about brand
health and customer engagement. They recommended that companies should fuse social media
measurement with other research as they design appropriate strategies for marketing mix. The
big question is how best to leverage the social media stream, especially to move beyond the data
gathering and summary stages to actually using the data to create value. Karlan and Deb’s
approach starts with research to identify the key issues of importance to different stakeholders
within the organization. Next, they determine specific thresholds (from social media metrics) that
would signal to each stakeholder (for whom the topic is important) that a significant event was
occurring that required attention. Following the event, the organization can model the effect of
the event on Key Performance Indicators (KPIs) by different customer groups. The authors
presented a case study to illustrate the principles.
Brand Imagery Measurement: Assessment of Current Practice and a New Approach
(Paul Richard McCullough, MACRO Consulting, Inc.): Dick (Richard) reviewed the weaknesses
in current brand imagery measurement practices, specifically the weaknesses of the rating scale
(lack of discrimination, scale use bias, halo). A new approach, brand-anchored MaxDiff,
removed halo, avoids scale use bias, and is more discriminating. The process involves showing
viii
the respondent a brand directly above a MaxDiff question involving, say, 4 or 5 imagery items.
Respondents indicate which of the items most describes the brand and which least describes the
brand. Anchored scaling MaxDiff questions (to estimate a threshold anchor point) allow
comparisons across brands and studies. But, anchored scaling re-introduces some scale use bias.
Dick tried different approaches to reduce the scale use bias associated with anchored MaxDiff
using an empirical study. Part of the empirical study involved different measures of brand
preference. He found that MaxDiff provided better discrimination, better predictive validity (of
brand preference), and greater reduction of brand halo and scale use bias than traditional ratingsbased measures of brand imagery. Ratings provided no better predictive validity of brand
preference in his model than random data. However, the new approach also took more
respondent time and had higher abandonment rates.
ACBC Revisited (Marco Hoogerbrugge, Jeroen Hardon, and Christopher Fotenos, SKIM
Group): Christopher and his co-authors reviewed the stages in an Adaptive CBC (ACBC)
interview and provided their insights into why ACBC has been a successful conjoint analysis
approach. They emphasized that ACBC has advantages with more complex attribute lists and
markets. The main thrust of their paper was to test different ACBC interviewing options,
including a dynamic form of CBC programmed by the SKIM Group. They conducted a splitsample study involving choices for televisions. They compared default CBC and ACBC
questionnaires to modifications of ACBC and CBC. Specifically, they investigated whether
dropping the “screener” section in ACBC would hurt results; using a smaller random shock
within summed pricing; whether to include price or not in ACBC’s unacceptable questions; and
the degree to which ACBC samples concepts directly around the BYO-selected concept. For the
SKIM-developed dynamic CBC questionnaire, the first few choice tasks were exactly like a
standard CBC task. The last few tasks displayed winning concepts chosen in the first few tasks.
In terms of prediction of the holdout tasks, all ACBC variants did better than the CBC variants.
None of the ACBC variations seemed to make much difference, suggesting that the ACBC
procedure is quite robust even with simplifications such as removing the screening section.
Research Space and Realistic Pricing in Shelf Layout Conjoint (SLC) (Peter Kurz, TNS
Infratest, Stefan Binner, bms marketing research + strategy, and Leonhard Kehl, Premium Choice
Research & Consulting): In the early 1990s, the first CBC questionnaires only displayed a few
product concepts on the screen, without the use of graphics. Later versions supported shelflooking displays, complete with graphics and other interactive elements. Rather than using lots
of attributes described in text, the graphics themselves portrayed different sizes, claims, and
package design elements. However, even the most sophisticated computerized CBC surveys
(including virtual-reality) cannot reflect the real situation of a customer at the supermarket. The
authors outlined many challenges involving shelf layout conjoint (SLC). Some of the strengths of
SLC, they suggested, are in optimization of assortment (e.g., line extension problems,
substitution) and price positioning/promotions. Certain research objectives are problematic for
SLC, including volumetric predictions, positioning of products on the shelf, and new product
development. The authors concluded by offering specific recommendations for improving results
when applying SLC, including: use realistic pricing patterns and ranges within the tasks, use
realistic tag displays, and reducing the number of parameters to estimate within HB models.
Attribute Non-Attendance in Discrete Choice Experiments (Dan Yardley, Maritz
Research): When respondents ignore certain attributes when answering CBC tasks, this is called
“attribute non-attendance.” Dan described how some researchers in the past have asked
ix
respondents directly which attributes they ignored (stated non-attendance) and have used that
information to try to improve the models. To test different approaches to dealing with nonattendance, Dan conducted two CBC studies. The first involved approximately 1300
respondents, using both full- and partial-profile CBC. The second involved about 2000
respondents. He examined both aggregate and disaggregate (HB) models in terms of model fit
and out-of-sample holdout prediction. Dan also investigated ways to try to ascertain from HB
utilities that respondents were ignoring certain attributes (rather than rely on stated nonattendance). For attributes deemed to have been ignored by a respondent, the codes in the
independent variable matrix were held constant at zero. He found that modeling stated nonattendance had little impact on the results, but usually slightly reduced the fit to holdouts. He
experimented with different cutoff rates under HB modeling to deduce whether individual
respondents had ignored attributes. For his two datasets, he was able to slightly improve
prediction of holdouts using this approach.
Anchored Adaptive MaxDiff: Application in Continuous Concept Test (Rosanna Mau,
Jane Tang, LeAnn Helmrich, and Maggie Cournoyer, Vision Critical): Rosanna and her coauthors investigated the feasibility of using an adaptive form of anchored MaxDiff within multiwave concept tests as a replacement for traditional 5-point rating scales. Concept tests have
traditionally been done with 5-point scales, with the accompanying lack of discrimination and
scale use bias. Anchored MaxDiff has proven to have superior discrimination, but the stability of
the anchor (the buy/no buy threshold) has been called into question in previous research
presented at the Sawtooth Software conference. Specifically, the context of how many concepts
are being evaluated within the direct anchoring approach can affect the absolute position of the
anchor. This would be extremely problematic for using the anchored MaxDiff approach to
compare the absolute desirability of concepts across multiple waves of research that involve
differing numbers and quality of product concepts. To reduce the context effect for direct anchor
questions, Rosanna and her co-authors used an Adaptive MaxDiff procedure to obtain a rough
rank-ordering of items for each respondent. Then, in real-time, they asked respondents binary
purchase intent questions for six items ranging along the continuum of preference from the
respondent’s best to the respondent’s worst items. They compared results across multiple waves
of data collection involving different numbers of product concepts. They found good consistency
across waves and that the MaxDiff approach led to greater discrimination among the top product
concepts than the ratings questions.
How Important Are the Obvious Comparisons in CBC? The Impact of Removing Easy
Conjoint Tasks (Paul Johnson and Weston Hadlock, SSI): One well-known complaint about
CBC questionnaires is that they can often display obvious comparisons (dominated concepts)
within a choice task. Obvious comparisons are those which the respondent recognizes that one
concept is logically inferior in every way to another concept. After encountering a conjoint
analysis study where a full 60% of experimentally designed choice tasks included a logically
dominated concept, Paul and Weston decided to experiment on the effect of removing dominated
concepts. They fielded that same study among 500 respondents, where half the sample received
the typically designed CBC tasks and the other half received CBC tasks wherein the authors
removed any tasks including dominated concepts, replacing them tasks without dominated
concepts (by modifying the design file in Excel). They found little differences between the two
groups in terms of predictability of holdout tasks or length of time to complete the CBC
questionnaire. They asked some follow-up qualitative questions regarding that survey taking
experience and found no significant differences between the two groups of respondents. Paul and
x
Weston concluded that if it requires extra effort on the part of the researcher to modify the
experimental design to avoid dominated concepts, then it probably isn’t worth the extra effort in
terms of quality of the results or respondent experience.
Segmenting Choice and Non-Choice Data Simultaneously (Thomas C. Eagle, Eagle
Analytics of California): This presentation focused on how to leverage both choice data (such as
CBC or MaxDiff) and non-choice data (other covariates, whether nominal or continuous) to
develop effective segmentation solutions. Tom compared and contrasted two common
approaches: a) first estimating individual-level utility scores using HB and then using those
scores plus non-choice data as basis variables within cluster analysis, or b) simultaneous utility
estimation leveraging choice and non-choice data using latent class procedures, specifically
Latent Gold software. Tom expressed that he worries about the two-step procedure (HB followed
by clustering), for at least two reasons: first, errors in the first stage are taken as given in the
second stage; and second, HB involves prior assumptions of population normality, leading to at
least some degree of Bayesian smoothing to the mean—which is at odds with the notion of
forming distinct segments. Using simulated data sets with known segmentation structure, Tom
compared the two approaches. Using the two-stage approach leads to the additional complication
of needing to somehow normalize the scores for each respondent to try to remove the scale
confound. Also, there are issues involving setting HB priors that affect the results, but aren’t
always clear to the researcher regarding which settings to invoke. Tom also found that whether
using the two-stage approach or the simultaneous one-stage approach, the BIC criterion often
failed to point to the correct number of segments. He commented that if a clear segmentation
exists (wide separation between groups and low response error), almost any approach will find it.
But, any segmentation algorithm will find patterns in data even if meaningful patterns do not
exist.
Extending Cluster Ensemble Analysis via Semi-Supervised Learning (Ewa Nowakowska,
GfK Custom Research North America and Joseph Retzer, CMI Research, Inc.): Ewa and
Joseph’s work focused on obtaining not only high quality segmentation results, but actionable
ones, where actionable is defined as having particular managerial relevance (such as
discriminating between intenders/non-intenders). They also reviewed the terminology of
unsupervised vs. supervised learning. Unsupervised learning involves discovering latent
segments in data using a series of basis variables (e.g., cluster algorithms). Supervised learning
involves classifying respondents into specific target outcomes (e.g., purchasers and nonpurchasers), such as logistic regression, CART, Neural Nets, and Random Forests. Semisupervised learning combines aspects of supervised and unsupervised learning to find segments
that are of high quality (in terms of discrimination among basis variables) and actionable (in
terms of classifying respondents into categories of managerial interest). Ewa and Joe’s main tools
to do this were Random Forests (provided in R) and Sawtooth Software’s CCEA (Convergent
Cluster & Ensemble Analysis). The authors used the multiple solutions provided by Random
Forests to compute a respondent-by-respondent similarities matrix (based on how often
respondents ended up within the same terminal node). They employed hierarchical clustering
analysis to develop cluster solutions based on the similarities data. These cluster solutions were
combined with standard unsupervised cluster solutions (developed on the basis variables) to
create ensembles of segmentation solutions, which CCEA software in turn used to create a high
quality and actionable consensus cluster solution. Ewa and Joe wrapped it up by showing a webbased simulator that assigns respondents into segments based on responses to basis variables.
xi
The Shapley Value in Marketing Research: 15 Years and Counting (Michael Conklin and
Stan Lipovetsky, GfK): Michael (supported by co-author Stan) explained that Shapley Value not
only is an extension of standard TURF analysis, but it can be applied in numerous other
marketing research problems. The Shapley Value derives from game theory. In simplest terms,
one can think about the value that a hockey player provides to a team (in terms of goals scored
by the team per minute) when this player is on the ice. For marketing research TURF problems,
the Shapley Value is the unique value contributed by a flavor or brand within a lineup when
considering all possible lineup combinations. As one possible extension, Michael described a
Shapley Value model to predict share of choice for SKUs on a shelf. Respondents indicate which
SKUs are in the consideration set and the Shapley Value is computed (across thousands of
possible competitive sets) for each SKU. This value considers the likelihood that the SKU is in
the consideration set and importantly the likelihood that the SKU is chosen within each set
(equal to 1/n for each respondent, where n is the number of items in the consideration set). The
benefits of this simple model of consumer behavior are that it can accommodate very large
product categories (many SKUs) and it is very inexpensive to implement. The drawback is that
each SKU must be a complete, fixed entity on its own (not involving varying attributes, such as
prices). As yet another field for application of Shapley Value, Michael spoke of its use in drivers
analysis (rather than OLS or other related techniques). However, Michael emphasized that he
thought the greatest opportunities for Shapley Value in marketing research lie in the product line
optimization problems.
Demonstrating the Need and Value for a Multi-objective Product Search (Scott Ferguson
and Garrett Foster, North Carolina State University): Scott and Garrett reviewed the typical steps
involved in optimization problems for conjoint analysis, including estimating respondent
preferences via a conjoint survey and gathering product feature costs. Usually, such optimization
tasks involve setting a single goal, such as optimization of share of preference, utility, revenue,
or profit. However, there is a set of solutions on an efficient frontier that represent optimal mixes
of multiple goals, such as profit and share of preference. For example, two solutions may be very
similar in terms of profit, but the slightly lower profit solution may provide a large gain in terms
of share of preference. A multi-objective search algorithm reports dozens or more results
(product line configurations) to managers along the efficient frontier (among multiple objectives)
for their consideration. Again, those near-optimal solutions represent different mixes of multiple
objectives (such as profit and share of preference). Scott’s application involved genetic
algorithms. More than two objectives might be considered, Scott elaborated, for instance profit,
share of preference, and likelihood to be purchased by a specific respondent demographic. One
of the keys to being able to explore the variety of near-optimal solutions, Scott emphasized, was
the use of software visualization tools.
A Simulation Based Evaluation of the Properties of Anchored MaxDiff: Strengths,
Limitations and Recommendations for Practice (Jake Lee, Maritz Research and Jeffrey P.
Dotson, Brigham Young University): Jake and Jeff conducted a series of simulation studies to
test the properties of three methods for anchored MaxDiff: direct binary, indirect (dual response),
and the status quo method. The direct approach involves asking respondents if each item is
preferred to (more important than) the anchor item (where the anchor item is typically a buy/no
buy threshold or important/not important threshold). The indirect dual-response method involves
asking (after each MaxDiff question) if all the items shown are important, all are not important,
or some are important. The status quo approach involves adding a new item to the item list that
indicates the status quo state (e.g., no change). Jake and Jeff’s simulation studies examined data
xii
situations in which respondents were more or less consistent along with whether the anchor
position was at the extreme of the scale or near the middle. They concluded that under most
realistic conditions, all three methods work fine. However, they recommended that the status quo
method be avoided if all items are above or below the threshold. They also reported that if
respondent error was especially high, the direct method should be avoided (though this usually
cannot be known ahead of time).
Best-Worst CBC Conjoint Applied to School Choice: Separating Aspiration from
Aversion (Angelyn Fairchild, RTI International, Namika Sagara and Joel Huber, Duke
University): Most CBC research today asks respondents to select the best concept within each
set. Best-Worst CBC involves asking respondents to select both the best and worst concepts
within sets of at least three concepts. School choice involves both positive and negative reactions
to features, so it naturally would seem a good topic for employing best-worst CBC. Joel and his
co-authors fielded a study among 150 parents with children entering grades 6–11. They used a
gradual, systematic introduction of the attributes to respondents. Before beginning the CBC task,
they asked respondents to select the level within each attribute that best applied to their current
school; then, they used warm-up tradeoff questions that showed just a few attributes at a time
(partial-profile). When they compared results from best-only versus worst-only choices, they
consistently found smaller utility differences between the best two levels for best-only choices.
They also employed a rapid OLS-based utility estimation for on-the-fly estimation of utilities (to
provide real-time feedback to respondents). Although the simple method is not expected to
provide as accurate results as an HB model run on the entire dataset, the individual-level results
from the OLS estimation correlated quite strongly with HB results. They concluded that if the
decision to be studied involves both attraction and avoidance, then a Best-Worst CBC approach
is appropriate.
Does the Analysis of MaxDiff Data Require Separate Scaling Factors? (Jack Horne and
Bob Rayner, Market Strategies International): The traditional method of estimating scores for
MaxDiff experiments involves combining both best and worst choices and estimating as a single
multinomial logit model. Fundamental to this analysis is the assumption that the underlying
utility or preference dimension is the same whether respondents are indicating which items are
best or which are worst. It also assumes that response errors for selecting bests are equivalent to
errors when selecting worsts. However, empirical evidence suggests that neither the utility scale
nor the error variance is the same for best and worst choices in MaxDiff. Using simulated data,
Jack and his co-author Bob investigated to what degree incompatibilities in scale between best
and worst choices affects the final utility scores. They adjusted the scale of one set of choices
relative to the other by multiplying the design matrix for either best or worst choices by a
constant, prior to estimating final utility scores. The final utilities showed the same rank-order
before and after the correction, though the utilities did not lie perfectly on a 45-degree line when
the two sets were XY scatter plotted. Next, the authors turned to real data. They first measured
the scale of bests relative to worsts and next estimated a combined model with correction for
scale differences. Whether correcting for scale or not resulted in essentially the same holdout hit
rate for HB estimation. They concluded that although combining best and worst judgments
without an error scale correction biases the utilities, the resulting rank order of the items remains
unchanged and is likely too small to change any business decisions. Thus, the extra work is
probably not justified. As a side-note, the authors suggested that comparing best-only and worstonly estimated utilities for each respondent is yet another way to identify and clean noisy
respondents.
xiii
Using Conjoint Analysis to Determine the Market Value of Product Features (Greg
Allenby, Ohio State University, Jeff Brazell, The Modellers, John Howell, Penn State University,
and Peter Rossi, University of California Los Angeles): The main thrust of this paper was to
outline a more defensible approach for using conjoint analysis to attach economic value to
specific features than is commonly used in many econometric applications and in intellectual
property litigation. Peter (and his co-authors) described how conjoint analysis is often used in
high profile lawsuits to assess damages. One of the most common approaches used by expert
witnesses is to take the difference in utility (for each respondent) between having and not having
the infringed upon feature and dividing it by the price slope. This, Peter argued, is fraught with
difficulties including a) certain respondents projected to pay astronomically high amounts for
features, and b) the approach ignores important competitive realities in the marketplace. Those
wanting to present evidence for high damages are prone to use conjoint analysis in this way
because it is relatively inexpensive to conduct and the difference in utility divided by price slope
method to compute the economic value of features usually results in very large estimates for
damages. Peter and his co-authors argued that to assess the economic value of a feature to a firm
requires conducting market simulations (a share of preference analysis) involving a realistic set
of competitors, including the outside good (the “None” category). Furthermore, it requires a
game theoretic approach to compare the industry equilibrium prices with and without the alleged
patent infringement. This involves allowing each competitor to respond to the others via price
changes to maximize self-interest (typically, profit).
The Ballad of Best and Worst (Tatiana Dyachenko, Rebecca Walker Naylor, and Greg
Allenby, Ohio State University): Greg presented research completed primarily by the lead author,
Tatiana, at our conference (unfortunately, Tatiana was unable to attend). In that work, Tatiana
outlines two different perspectives regarding MaxDiff. Most current economic models for
MaxDiff assume that the utilities should be invariant to the elicitation procedure. However,
psychological theories would expect different elicitation modes to produce different utilities.
Tatiana conducted an empirical study (regarding concerns about hair health among 594 female
respondents, aged 50+) to test whether best and worst responses lead to different parameters, to
investigate elicitation order effects, and to build a comprehensive model that accounted for both
utility differences between best and worst answers as well as order effects (best answered first or
worst answered first). Using her model, she and her co-authors found significant terms
associated with elicitation order effects and the difference between bests and worsts. In their
data, respondents were more sure about the “worsts” than the “bests” (lower error variance
around worsts). Furthermore, they found that the second decision made by respondents was less
error prone. Tatiana and her co-authors recommend that researchers consider which mode of
thinking is most appropriate for the business decision in question: maximizing best aspects or
minimizing worst aspects. Since the utilities differ depending on the focus on bests or worsts, the
two are not simply interchangeable. But, if researchers decide to ask both best and worsts, they
recommend analyzing it using a model such as theirs that can account for differences between
bests and worst and also for elicitation order effects.
xiv
9 THINGS CLIENTS GET WRONG ABOUT CONJOINT ANALYSIS
CHRIS CHAPMAN1
GOOGLE
ABSTRACT
This paper reflects on observations from over 100 conjoint analysis projects across the
industry and multiple companies that I have observed, conducted, or informed. I suggest that
clients often misunderstand the results of conjoint analysis (CA) and that the many successes of
CA may have created unrealistic expectations about what it can deliver in a single study. I
describe some common points of misunderstanding about preference share, feature assessment,
average utilities, and pricing. Then I suggest how we might make better use of distribution
information from Hierarchical Bayes (HB) estimation and how we might use multiple samples
and studies to inform client needs.
INTRODUCTION
Decades of results from the marketing research community demonstrate that conjoint
analysis (CA) is an effective tool to inform strategic and tactical marketing decisions. CA can be
used to gauge consumer interest in products and to inform estimates of feature interest, brand
equity, product demand, and price sensitivity. In many well-conducted studies, analysts have
demonstrated success using CA to predict market share and to determine strategic product line
needs.2
However, the successes of CA also raise clients’ expectations to levels that can be
excessively optimistic. CA is widely taught in MBA courses, and a new marketer in industry is
likely soon to encounter CA success stories and business questions where CA seems appropriate.
This is great news . . . if CA is practiced appropriately. The apparent ease of designing, fielding,
and analyzing a CA study presents many opportunities for analysts and clients to make mistakes.
In this paper, I describe some misunderstandings that I’ve observed in conducting and
consulting on more than 100 CA projects. Some of these come from projects I’ve fielded while
others have been observed in consultation with others; none is exemplary of any particular firm.
Rather, the set of cases reflects my observations of the field. For each one I describe the problem
and how I suggest to rectify it in clients’ understanding.
All data presented here are fictional. The data primarily concern an imaginary “designer USB
drive” that comprises nominal attributes such as size (e.g., Nano, Full-length), design style,
ordinal attributes of capacity (e.g., 32 GB), and price. The data were derived by designing a
choice-based conjoint analysis survey, having simulated respondents making choices, and
estimating the utilities using Hierarchical Bayes multinomial logit estimation. For full details,
1
2
[email protected]
There are too many published successes for CA to list them comprehensively. For a start, see papers in this and other volumes of the
Proceedings of the Sawtooth Software Conference. Published cases where this author contributed used CA to inform strategic analysis using
game theory (Chapman & Love, 2012), to search for optimum product portfolios (Chapman & Alford, 2010), and to predict market share
(Chapman, Alford, Johnson, Lahav, & Weidemann, 2009). This author also helped compile evidence of CA reliability and validity (Chapman,
Alford, & Love, 2009).
1
refer to the source of the data: simulation and example code given in the R code “Rcbc”
(Chapman, Alford, and Ellis, 2013; available from this author).
The data here were not designed to illustrate problems; rather, they come from didactic R
code. It just happens that those data—like data in most CA projects—are misinterpretable in all
the common ways.
MISTAKE #1: CONJOINT ANALYSIS DIRECTLY TELLS US HOW MANY PEOPLE WILL BUY
THIS PRODUCT
A simple client misunderstanding is that CA directly estimates how many consumers will
purchase a product. It is simple to use part-worth utilities to estimate preference share and
interpret this as “market share.” Table 1 demonstrates this using the multinomial logit formula
for aggregate share between two products. In practice, one might use individual-level utilities in
a market simulator such as Sawtooth Software SMRT, but the result is conceptually the same.
Table 1: Example Preference Share Calculation
Product 1
Product 2
Total
Sum of utilities
1.0
0.5
--
Exponentiated
2.72
1.65
4.37
Share of total
62%
38%
As most research practitioners know but many clients don’t (or forget), the problem is this:
preference share is only partially indicative of real market results. Preference share is an
important input to a marketing model, yet is only one input among many. Analysts and clients
need to determine that the CA model is complete and appropriate (i.e., valid for the market) and
that other influences are modeled, such as awareness, promotion, channel effects, competitive
response, and perhaps most importantly, the impact of the outside good (in other words, that
customers could choose none of the above and spend money elsewhere).
I suspect this misunderstanding arises from three sources. First, clients very much want CA
to predict share! Second, CA is often given credit for predicting market share even when CA was
in fact just one part of a more complex model that mapped CA preference to the market. Third,
analysts’ standard practice is to talk about “market simulation” instead of “relative preference
simulation.”
Instead of claiming to predict market share, I tell clients this: conjoint analysis assesses how
many respondents prefer each product, relative to the tested alternatives. If we iterate studies,
know that we’re assessing the right things, calibrate to the market, and include other effects, we
will get progressively better estimates of the likely market response. CA is a fundamental part of
that, yet only one part. Yes, we can predict market share (sometimes)! But an isolated, singleshot CA is not likely to do so very well.
MISTAKE #2: CA ASSESSES HOW GOOD OR BAD A FEATURE (OR PRODUCT) IS
The second misunderstanding is similar to the first: clients often believe that the highest partworth indicates a good feature while negative part-worths indicate bad ones. Of course, all
utilities really tell us is that, given the set of features and levels presented, this is the best fit to a
2
set of observed choices. Utilities don’t indicate absolute worth; inclusion of different levels
likely would change the utilities.
A related issue is that part-worths are relative within a single attribute. We can compare
levels of an attribute to one another—for instance, to say that one memory size is preferable to
another memory size—but should not directly compare the utilities of levels across attributes (for
instance, to say that some memory size level is more or less preferred than some level of color or
brand or processor). Ultimately, product preference involves full specification across multiple
attributes and is tested in a market simulator (I say more about that below).
I tell clients this: CA assesses tradeoffs among features to be more or less preferred. It does
not assess absolute worth or say anything about untested features.
MISTAKE #3: CA DIRECTLY TELLS US WHERE TO SET PRICES
Clients and analysts commonly select CA as a way to assess pricing. What is the right price?
How will such-and-such feature affect price? How price sensitive is our audience? All too often,
I’ve seen clients inspect the average part-worths for price—often estimated without constraints
and as piecewise utilities—and interpret them at face value.
Figure 1 shows three common patterns in price utilities; the dashed line shows scaling in
exact inverse proportion to price, while the solid line plots the preference that we might observe
from CA (assuming a linear function for patterns A and B, and a piecewise estimation in pattern
C, although A and B could just as well be piecewise functions that are monotonically
decreasing).
In pattern A, estimated preference share declines more slowly than price (or log price)
increases. Clients love this: the implication is to price at the maximum (presumably not to
infinity). Unfortunately, real markets rarely work that way; this pattern more likely reflects a
method effect where CA underestimates price elasticity.
Figure 1: Common Patterns in Price Utilities
A: Inelastic demand
B: Elastic demand
C: Curved demand
In pattern B, the implication is to price at the minimum. The problem here is that relative
preference implies range dependency. This may simply reflect the price range tested, or reflect
that respondents are using the survey for communication purposes (“price low!”) rather than to
express product preferences.
Pattern C seems to say that some respondents like low prices while others prefer high prices.
Clients love this, too! They often ask, “How do we reach the price-insensitive customers?” The
3
problem is that there is no good theory as to why price should show such an effect. It is more
likely that the CA task was poorly designed or confusing, or that respondents had different goals
such as picking their favorite brand or heuristically simplifying the task in order to complete it
quickly. Observation of a price reversal as we see here (i.e., preference going up as price goes up
in some part of the curve) is more likely an indication of a problem than an observation about
actual respondent preference!
If pattern C truly does reflect a mixture of populations (elastic and inelastic respondents) then
there are higher-order questions about the sample validity and the appropriateness of using
pooled data to estimate a single model. In short: pattern C is seductive! Don’t believe it unless
you have assessed carefully and ruled out the confounds and the more theoretically sound
constrained (declining) price utilities.
What I tell clients about price is: CA provides insight into stated price sensitivity, not exact
price points or demand estimates without a lot more work and careful consideration of models,
potentially including assessments that attempt more realistic incentives, such as incentivealigned conjoint analysis (Ding, 2007). When assessing price, it’s advantageous to use multiple
methods and/or studies to confirm that answers are consistent.
MISTAKE #4: THE AVERAGE UTILITY IS THE BEST MEASURE OF INTEREST
I often see—and yes, sometimes even produce—client deliverables with tables or charts of
“average utilities” by level. This unfortunately reinforces a common cognitive error: that the
average is the best estimate. Mathematically, of course, the mean of a distribution minimizes
some kinds of residuals—but that is rarely how a client interprets an average!
Consider Table 2. Clients interpret this as saying that Black is a much better feature than Tiedye. Sophisticated ones might ask whether it is statistically significant (“yes”) or compute the
preference share for Black (84%). None of that answers the real question: which is better for the
decision at hand?
Table 2: Average Feature Utilities
Feature
Black
Tie-dye
...
Average Utility
0.79
-0.85
...
Figure 3 is what I prefer to show clients and presents a very different picture. In examining
Black vs. Tie-dye, we see that the individual-level estimates for Black have low variance while
Tie-dye has high variance. Black is broadly acceptable, relative to other choices, while Tie-dye is
polarizing.
Is one better? That depends on the goal. If we can only make a single product, we might
choose Black. If we want a diverse portfolio with differently appealing products, Tie-dye might
fit. If we have a way to reach respondents directly, then Silver might be appealing because a few
people strongly prefer it. Ultimately this decision should be made on the basis of market
simulation (more on that below), yet understanding the preference structure more fully may help
an analyst understand the market and generate hypotheses that otherwise might be overlooked.
4
Figure 3: Distribution of Individual-Level Utilities from HB Estimation
The client takeaway is this: CA (using HB) gives us a lot more information than just average
utility. We should use that information to have a much better understanding of the distribution of
preference.
MISTAKE #5: THERE IS A TRUE SCORE
The issue about average utility (problem #4 above) also arises at the individual level.
Consider Figure 4, which presents the mean betas for one respondent. This respondent has low
utilities for features 6 and 10 (on the X axis) and high utilities for features 2, 5, and 9.
It is appealing to think that we have a psychic X-ray of this respondent, that there is some
“true score” underlying these preferences, as a social scientist might say. There are several
problems with this view. One is that behavior is contextually dependent, so any respondent might
very well behave differently at another time or in another context (such as a store instead of a
survey). Yet even within the context of a CA study, there is another issue: we know much more
about the respondent than the average utility!
Figure 4: Average Utility by Feature, for One Respondent
Now compare Figure 5 with Figure 4. Figure 5 shows—for the same respondent—the withinrespondent distribution of utility estimates across 100 draws of HB estimates (using Monte Carlo
Markov chain, or MCMC estimation). We see significant heterogeneity. An 80% or 95% credible
interval on the estimates would find few “significant” differences for this respondent. This is a
5
more robust picture of the respondent, and inclines us away from thinking of him or her as a
“type.”
Figure 5: Distribution of HB Beta Estimates by Feature, for the Same Respondent
What I tell clients is this: understand respondents in terms of tendency rather than type.
Customers behave differently in different contexts and there is uncertainty in CA assessment.
The significance of that fact depends on our decisions, business goals, and ability to reach
customers.
MISTAKE #6: CA TELLS US THE BEST PRODUCT TO MAKE (RATHER EASILY)
Some clients and analysts realize that CA can be used not only to assess preference share and
price sensitivity but also to inform a product portfolio. In other words, to answer “What should
we make?”
An almost certainly wrong answer would be to make the product with highest utility, because
it is unlikely that the most desirable features would be paired with the best brand and lowest
price. A more sophisticated answer searches for preference tradeoff vs. cost in the context of a
competitive set. However, this method capitalizes on error and precise specification of the
competitive sets; it does not examine the sensitivity and generality of the result.
Better results may come by searching for a large set of near-optimum products and examine
their commonalities (Chapman and Alford, 2010; cf. Belloni et al., 2008). Another approach,
depending on the business question, would be to examine likely competitive response to a
decision using a strategic modeling approach (Chapman and Love, 2012). An analyst could
combine the approaches: investigate a set of many potential near-optimal products, choose a set
of products that is feasible, and then investigate how competition might respond to that line.
Doing this is a complex process: it requires extraordinarily high confidence in one’s data, and
then one must address crucial model assumptions and adapt (or develop) custom code in R or
some other language to estimate the models (Chapman and Alford, 2010; Chapman and Love,
2012). The results can be extremely informative—for instance, a product identified in Chapman
and Alford (2010) was identified by the model fully 17 months in advance of its introduction to
the market by a competitor—but arriving at such an outcome is a complex undertaking built on
impeccable data (and perhaps luck).
6
In short, when clients wish to find the “best product,” I explain: CA informs us about our
line, but precise optimization requires more models, data, and expertise.
MISTAKE #7: GET AS MUCH STATISTICAL POWER (SAMPLE) AS POSSIBLE
This issue is not specific to CA but to research in general. Too many clients (and analysts) are
impressed with sample size and automatically assume that more sample is better.
Figure 6 shows the schematic of a choice-based conjoint analysis (CBC) study I once
observed. The analyst had a complex model with limited sample and wanted to obtain adequate
power. Each CBC task presented 3 products and a None option . . . and respondents were asked
to complete 60 such tasks!
Figure 6: A Conjoint Analysis Study with Great “Power”
Power is directly related to confidence intervals, and the problem with confidence intervals
(in classical statistics) is that they scale to the inverse square root of sample size. When you
double the sample size, you only reduce the confidence interval by 30% (1-1/√2). To cut the
confidence interval in half requires 4x the sample size. This has two problems: diminishing
returns, and lack of robustness to sample misspecification. If your sample is a non-probability
sample, as most are, then sampling more of it may not be the best approach.
I prefer instead to approach sample size this way: determine the minimum sample needed to
give an adequate business answer, and then split the available sampling resources into multiple
chunks of that size, assessing each one with varying methods and/or sampling techniques. We
can have much higher confidence when findings come from multiple samples using multiple
methods.
What I tell clients: instead of worrying about more and more statistical significance, we
should maximize interpretative power and minimize risk. I sketch what such multiple
assessments might look like. “Would you rather have: (1) Study A with N=10000, or (2) Study A
with 1200, Study B with 300, Study C with 200, and Study D with 800?” Good clients
understand immediately that despite having ¼ the sample, Plan 2 may be much more
informative!
7
MISTAKE #8: MAKE CA FIT WHAT YOU WANT TO KNOW
To address tough business questions, it’s a good idea to collect customer data with a method
like CA. Unfortunately, this may yield surveys that are more meaningful to the client than the
respondent.
I find this often occurs with complex technical features (that customers may not understand)
and messaging statements (that may not influence CA survey behavior). Figure 7 presents a
fictional CBC task about wine preferences. It was inspired by a poorly designed survey I once
took about home improvement products; I selected wine as the example because it makes the
issue particularly obvious.
Figure 7: A CBC about Wine
Imagine you are selecting a bottle of wine for a special celebration dinner at home.
If the following wines were your only available choices, which would you purchase?
75% Cabernet Sauvignon
75% Cabernet Sauvignon
Blend
20% Merlot
15% Merlot
4% Cabernet Franc
10% Cabernet Franc
1% Malbec
Custom crush
Negotiant
Bottle size
700ml
750ml
Cork type
Grade 2
Double disk (1+1)
(None, unfined)
Potassium caseinate
Bottling line type
Mobile
On premises
Origin of bottle glass
Mexico
China
◌
◌
Winery type
Fining agent
Our fictional marketing manager is hoping to answer questions like these: should we fine our
wines (cause them to precipitate sediment before bottling)? Can we consider cheaper bottle
sources? Should we invest in an in-house bottling line (instead of truck that moves between
facilities)? Can we increase the Cabernet Franc in our blend (for various possible reasons)? And
so forth.
Those are all important questions but posing their technical features to customers results in a
survey that only a winemaker could answer! A better survey would map the business
consideration to features that a consumer can address, such as taste, appearance, aging potential,
cost, and critics’ scores. (I leave the question of how to design that survey about wine as an
exercise for the reader.)
This example is extreme, yet how often do we commit similar mistakes in areas where we are
too close to the business? How often do we test something “just to see if it has an effect?” How
often do we describe something the way that R&D wants? Or include a message that has little if
any real information? And then, when we see a null effect, are we sure that it is because
customers don’t care, or could it be because the task was bad? (A similar question may be asked
in case of significant effects.) And, perhaps most dangerously, how often do we field a CA
without doing do a small-sample pretest?
8
The implication is obvious: design CA tasks to match what respondents can answer reliably
and validly. And before fielding, pretest the attributes, levels, and tasks to make sure!
(NON!-) MISTAKE #9: IT’S BETTER THAN USING OUR INSTINCTS
Clients, stakeholders, managers, and sometimes even analysts are known to say, “Those
results are interesting but I just don’t believe them!” Then an opinion is substituted for the data.
Of course CA is not perfect—all of the above points demonstrate ways in which it may go
wrong, and there are many more—but I would wager this: a well-designed, well-fielded CA is
almost always better than expert opinion. Opinions of those close to a product are often
dramatically incorrect (cf. Gourville, 2004). Unless you have better and more reliable data that
contradicts a CA, go with the CA.
If we consider this question in terms of expected payoff, I propose that the situation
resembles Figure 8. If we use data, our estimates are likely to be closer to the truth than if we
don’t. Sometimes they will be wrong, but will not be as wrong on average as opinion would be.
Figure 8: Expected Payoffs with and without Data
Use data
Use instinct
Decision correct
Decision incorrect
High precision (high gain)
Low precision
(modest gain)
Low inaccuracy
(modest loss)
High inaccuracy
(large loss)
Net
expectation:
Positive
Negative
When we get a decision right with data, the relative payoff is much larger. Opinion is
sometimes right, but likely to be imprecise; when it is wrong, expert opinion may be disastrously
wrong. On the other hand, I have yet to observe a case where consumer data has been terribly
misleading; the worst case I’ve seen is when it signals a need to learn more. When opinion and
data disagree, explore more. Do a different study, with a different method and different sampling.
What I tell clients: it’s very risky to bet against what your customers are telling you! An
occasional success—or an excessively successful single opiner—does not disprove the value of
data.
MISTAKE #10 AND COUNTING
Keith Chrzan (2013) commented on this paper after presentation at the Sawtooth Software
Conference and noted that attribute importance is another area where there is widespread
confusion. Clients often want to know “Which attributes are most important?” but CA can only
answer this with regard to the relative utilities of the attributes and features tested. Including (or
omitting) a very popular or unpopular level on one attribute will alter the “importance” of every
other attribute!
CONCLUSION
Conjoint analysis is a powerful tool but its power and success also create conditions where
client expectations may be too high. We’ve seen that some of the simplest ways to view CA
9
results such as average utilities may be misleading, and that despite client enthusiasm they may
distract from answering more precise business questions. The best way to meet high expectations
is to meet them! This may require all of us to be more careful in our communications, analyses,
and presentations.
The issues here are not principally technical in nature; rather they are about how conjoint
analysis is positioned and how expectations are set and upheld through effective study design,
analysis, and interpretation. I hope the paper inspires you—and even better, inspires and informs
clients.
Chris Chapman
ACKNOWLEDGEMENTS
I’d like to thank Bryan Orme, who provided careful, thoughtful, and very helpful feedback at
several points to improve both this paper and the conference presentation. If this paper is useful
to the reader, that is in large part due to Bryan’s suggestions (and if it’s not useful, that’s due to
the author!) Keith Chrzan also provided thoughtful observation and reflections during the
conference. Finally, I’d like to thank all my colleagues over the years, and who are reflected in
the reference list. They spurred the reflections more than anything I did.
REFERENCES
Belloni, A., Freund, R.M, Selove, M., and Simester, D. (2008). Optimal product line design:
efficient methods and comparisons. Management Science 54: 9, September 2008, pp. 1544–
1552.
Chapman, C.N., Alford, J.L., and Ellis, S. (2013). Rcbc: marketing research tools for choicebased conjoint analysis, version 0.201. [R code
Chapman, C.N., and Love, E. (2012). Game theory and conjoint analysis: using choice data for
strategic decisions. Proceedings of the 2012 Sawtooth Software Conference, Orlando, FL,
March 2012.
Chapman, C.N., and Alford, J.L. (2010). Product portfolio evaluation using choice modeling and
genetic algorithms. Proceedings of the 2010 Sawtooth Software Conference, Newport Beach,
CA, October 2010.
10
Chapman, C.N., Alford, J.L., Johnson, C., Lahav, M., and Weidemann, R. (2009). Comparing
results of CBC and ACBC with real product selection. Proceedings of the 2009 Sawtooth
Software Conference, Del Ray Beach, FL, March 2009.
Chapman, C.N., Alford, J.L., and Love, E. (2009). Exploring the reliability and validity of
conjoint analysis studies. Presented at Advanced Research Techniques Forum (A/R/T
Forum), Whistler, BC, June 2009.
Chrzan, K. (2013). Remarks on “9 things clients wrong about conjoint analysis.” Discussion at
the 2013 Sawtooth Software Conference, Dana Point, CA, October 2013.
Ding, M. (2007). An incentive-aligned mechanism for conjoint analysis. Journal of Marketing
Research, 2007, pp. 214–223.
Gourville, J. (2004). Why customers don’t buy: the psychology of new product adoption. Case
study series, paper 9-504-056. Harvard Business School, Boston, MA.
11
QUANTITATIVE MARKETING RESEARCH SOLUTIONS IN A
TRADITIONAL MANUFACTURING FIRM:
UPDATE AND CASE STUDY
ROBERT J. GOODWIN
LIFETIME PRODUCTS, INC.
ABSTRACT
Lifetime Products, Inc., a manufacturer of folding furniture and other consumer hard goods,
provides a progress report on its quest for more effective analytic methods and offers an
insightful new ACBC case study. This demonstration of a typical adaptive choice study,
enhanced by an experiment with conjoint analysis design parameters, is intended to be of interest
to new practitioners and experienced users alike.
INTRODUCTION
Lifetime Products, Inc. is a privately held, vertically integrated manufacturing company
headquartered in Clearfield, Utah. The company manufactures consumer hard goods typically
constructed of blow-molded polyethylene resin and powder-coated steel. Its products are sold to
consumers and businesses worldwide, primarily through a wide range of discount and
department stores, home improvement centers, warehouse clubs, sporting goods stores, and other
retail and online outlets.
Over the past seven years, the Lifetime Marketing Research Department has adopted
progressively more sophisticated conjoint analysis and other quantitative marketing research
tools to better inform product development and marketing decision-making. The company’s
experiences in adopting and cost-effectively utilizing these sophisticated analytic methods—
culminating in its current use of Sawtooth Software’s Adaptive Choice-Based Conjoint (ACBC)
software—were documented in papers presented at previous Sawtooth Software Conferences
(Goodwin 2009, and Goodwin 2010).
In this paper, we first provide an update on what Lifetime Products has learned about
conjoint analysis and potential best practices thereof over the past three years. Then, for
demonstration purposes, we present a new Adaptive CBC case on outdoor storage sheds. The
paper concludes with a discussion of our experimentation with ACBC design parameters in this
shed study.
I. WHAT WE’VE LEARNED ABOUT CONJOINT ANALYSIS
This section provides some practical advice, intended primarily for new and corporate
practitioners of conjoint analysis and other quantitative marketing tools. This is based on our
experience at Lifetime Products as a “formerly new” corporate practitioner of conjoint analysis.
13
#1. Use Prudence in Conjoint Analysis Design
One of the things that helped drive our adoption of Sawtooth Software’s Adaptive ChoiceBased Conjoint program was its ability to administer conjoint analysis designs with large
numbers of attributes, without overburdening respondents. The Concept Screening phase of the
ACBC protocol allows each panelist to create a “short list” of potentially acceptable concepts
using whatever decision-simplification techniques s/he wishes, electing to downplay or even
ignore attributes they considered less essential to the purchase decision.
Further, we could allow them to select a subset of the most important attributes for inclusion
(or, alternatively, the least important attributes for exclusion) for the rest of the conjoint
experiment. Figure 1 shows an example page from our first ACBC study on Storage Sheds in
2008. Note that, in the responses entered, the respondent has selected eight attributes to
include—along with price and materials of construction, which were crucial elements in the
experiment—while implicitly excluding the other six attributes from further consideration.
Constructed lists could then be used to bring forward only the Top-10 attributes from an original
pool of 16 attributes, making the exercise much more manageable for the respondent. While the
part-worths for an excluded attribute would be zero for that observation, we would still capture
the relevant utility of that attribute for another panelist who retained it for further consideration
in the purchase decision.
Figure 1
Example of Large-scale ACBC Design
STORAGE SHEDS
We utilized this “winnowing” feature of ACBC for several other complex-design studies in
the year or two following its adoption at Lifetime. During the presentation of those studies to our
internal clients, we noticed a few interesting behaviors. One was the virtual fixation of a few
clients on a “pet” feature that (to their dismay) registered minimal decisional importance
14
following Hierarchical Bayes (HB) estimation. While paying very little attention to the most
important attributes in the experiment, they spent considerable time trying to modify the attribute
to improve its role in the purchase decision. In essence, this diverted their attention from what
mattered most in the consumers’ decision to what mattered least.
The more common client behavior was what could be called a “reality-check” effect. Once
the clients realized (and accepted) the minimal impact of such an attribute on purchase decisions,
they immediately began to concentrate on the more important array of attributes. Therefore,
when it came time to do another (similar) conjoint study, they were less eager to load up the
design with every conceivable attribute that might affect purchase likelihood.
Since that time, we have tended not to load up a conjoint study with large numbers of
attributes and levels, just because “it’s possible.” Instead, we have sought designs that are more
parsimonious by eliminating attributes and levels that we already know to be less important in
consumers’ decision-making. As a result, most of our recent studies have gravitated around
designs of 8–10 attributes and 20–30 levels.
Occasionally, we have found it useful to assess less-important attributes—or those that might
be more difficult to measure in a conjoint instrument—by testing them in regular questions
following the end of the conjoint experiment. For example, Figure 2 shows a follow-up question
in our 2013 Shed conjoint analysis survey to gauge consumers’ preference for a shed that
emphasized ease of assembly (at the expense of strength) vis-à-vis a shed that emphasized
strength (at the expense of longer assembly times). (This issue is relevant to Lifetime Products,
since our sheds have relatively large quantities of screws—making for longer assembly times—
but are stronger than most competitors’ sheds.)
Figure 2
Example of Post-Conjoint Preference Question
STORAGE SHEDS
#2. Spend Time to Refine Conjoint Analysis Instruments
Given the importance of the respondent being able to understand product features and
attributes, we have found it useful to spend extra time on the front end to ensure that the survey
instrument and conjoint analysis design will yield high-quality results. In a previous paper
(Goodwin, 2009), we reported the value of involving clients in instrument testing and debugging.
In a more general sense, we continue to review our conjoint analysis instruments and designs
with multiple iterations of client critique and feedback.
15
As we do so, we look out for several potential issues that could degrade the quality of
conjoint analysis results. First, we wordsmith attribute and level descriptions to maximize clarity.
For example, with some of our categories, we have found a general lack of understanding in the
marketplace regarding some attributes (such as basketball height-adjustment mechanisms and
backboard materials; shed wall, roof and floor materials; etc.). Attributes such as these
necessitate great care to employ verbiage that is understandable to consumers.
Another area we look out for involves close substitutes among levels of a given attribute,
where differences might be difficult for consumers to perceive, even in the actual retail
environment. For example, most mid-range basketball goals have backboard widths between 48
and 54 inches, in 2-inch increments. While most consumers can differentiate well between
backboards at opposite ends of this range, they frequently have difficulty deciding—or even
differentiating—among backboard sizes with 2-inch size differences. Recent qualitative research
with basketball system owners has shown that, even while looking at 50-inch and 52-inch
models side-by-side in a store, it is sometimes difficult for them (without looking at the product
labeling) to tell which one is larger than the other. While our effort is not to force product
discrimination in a survey where it may not exist that strongly in the marketplace itself, we want
to ensure that panelists are given a realistic set of options to choose from (i.e., so the survey
instrument is not the “problem”). Frequently, this means adding pictures or large labels showing
product size or feature call-outs to mimic in-store shopping as much as possible.
#3. More Judicious with Brand Names
Lifetime Products is not a household name like Coke, Ford, Apple, and McDonald’s. As a
brand sold primarily through big-box retailers, Lifetime Products is well known among the
category buyers who put our product on the shelf, but less so among consumers who take it off
the shelf. In many of our categories (such as tables & chairs, basketball, and sheds), the
assortment of brands in a given store is limited. Consequently, consumers tend to trust the
retailer to be the brand “gatekeeper” and to carry only the best and most reliable brands. In doing
so, they often rely less on their own brand perceptions and experiences.
Lifetime Products’ brand image is also confounded by misconceptions regarding other
entities such as the Lifetime Movie Channel, Lifetime Fitness Equipment, and even “lifetime”
warranty. There are also perceptual anomalies among competitor brands. For example,
Samsonite (folding tables) gets a boost from their well-known luggage line, Cosco (folding
chairs) is sometimes mistaken for the Costco store brand, and Rubbermaid (storage sheds) has a
halo effect from the wide array of Rubbermaid household products. Further, Lifetime kayaks
participate in a market that is highly fragmented with more than two dozen small brands, few of
which have significant brand awareness.
As a result, many conjoint analysis studies we have done produce flat brand utility profiles,
accompanied by low average-attribute-importance scores. This is especially the case when we
include large numbers of brand names in the exercise. Many of these brands end up with utility
scores lower than the “no brand” option, despite being well regarded by retail chain store buyers.
Because of these somewhat-unique circumstances in our business, Lifetime often uses
heavily abridged brand lists in its conjoint studies, or in some cases drops the brand attribute
altogether. In addition, in our most recent kayak industry study (with its plethora of unknown
16
brands), we had to resort to surrogate descriptions such as “a brand I do not know,” “a brand I
know to be good,” and so forth.
#4. Use Simulations to Estimate Brand Equity
Despite the foregoing, there are exceptions (most notably in the Tables & Chairs category)
where the Lifetime brand is relatively well known and has a long sales history among a few key
retailers. In this case, our brand conjoint results are more realistic, and we often find good
perceptual differentiation among key brand names, including Lifetime.
Lifetime sales managers often experience price resistance from retail buyers, particularly in
the face of new, lower-price competition from virtually unknown brands (in essence, “no
brand”). In instances like these, it is often beneficial to arm these sales managers with statistical
evidence of the value of the Lifetime brand as part of its overall product offering.
Recently, we generated such a brand equity analysis for folding utility tables using a reliable
conjoint study conducted a few years ago. In this context, we defined “per-unit brand equity” as:
The price reduction a “no-name” brand would have to use in order to replace the
Lifetime brand and maintain Lifetime’s market penetration.
The procedure we used for this brand equity estimation was as follows:
1. Generate a standard share of preference simulation, with the Lifetime table at its
manufacturer’s suggested retail price (MSRP), two competitor offerings at their
respective MSRPs, and the “None” option. (See left pie chart in Figure 3.)
2. Re-run the simulation using “no brand name” in place of the Lifetime brand, with no
other changes in product specifications (i.e., an exact duplicate offering except for the
brand name). The resulting share of preference for the “no-name” offering (which
otherwise duplicated the Lifetime attributes) decreased from the base-case share. (Note
that much of that preference degradation went to the “None” option, not to the other
competitors, suggesting possible strength of the Lifetime brand over the existing
competitors as well.)
3. Gradually decrease the price of the “no-name” offering until its share of preference
matched the original base case for the Lifetime offering. In this case, the price differential
was about -6%, which represents a reasonable estimate of the value of the Lifetime brand,
ceteris paribus. In other words, a no-name competitor with the same specification as the
Lifetime table would have to reduce its price 6% in order to maintain the same share of
preference as the Lifetime table. (See right pie chart in Figure 3.)
17
Figure 3
Method to Estimate Lifetime Brand Value Conjoint – Successes
Price Difference for Lifetime vs. No-Brand:
Estimate of Lifetime Brand Value
Shares of Preference When
Lifetime Brand Available
Would Not
Buy Any of
These
Lifetime
Table @
Retail Price
Shares of Preference When
Replaced by “No-Name”
6% Price Reduction
Needed to Garner the
Same Share as Lifetime
Would Not
Buy Any of
These
Brand X
Table
Brand Y
Table
Goodwin 9/18/13
No-Brand
Table @ 6%
Lower Price
Brand X
Table
Brand Y
Table
15
The 6% brand value may not seem high, compared with the perceived value of more wellknown consumer brands. Nevertheless, in the case of Lifetime tables, this information was very
helpful for our sales managers responding to queries from retail accounts to help justify higher
wholesale prices than the competition.
#5. Improved Our Simulation Techniques
Over the past half-dozen years of using conjoint analysis, Lifetime has improved its use of
simulation techniques to help inform management decisions and sales approaches. In these
simulations, we generally have found it most useful to use the “None” option in order to capture
buy/no-buy behavior and to place relatively greater emphasis on share of preference among
likely buyers. Most importantly, this approach allows us to measure the possible expansion (or
contraction) of the market due to the introduction of new product (or the deletion of an existing
product). We have found this approach particularly useful when simulating the behavior of likely
customers of our key retail partners and the change in their “market” size.
Recently, we conducted several simulation analyses to test the impact of pricing strategies for
our retail partners. We offer two of them here. In both cases, the procedure was to generate a
baseline simulation, not only of shares of preference (i.e., number of units), but also of revenue
and (where available) retail margin. We then conducted experimental “what-if” simulations to
compare with the baseline scenario. Because both situations involved multiple products—and the
potential for cross-cannibalization, we measured performance for the entire product line at the
retailer.
18
The first example involved a lineup of folding tables at a relatively large retail account (See
Figure 4). In question was the price of a key table model in that lineup, identified as Table Q in
the graphic. The lines in the graphic represent changes in overall table lineup units, revenue, and
retail margin, indexed against the current price-point scenario for Table Q (Index = 1.00). We ran
a number of experimental simulations based on adjustments to the Table Q price point and ran
the share of preference changes through pricing and margin calculations for the entire lineup.
Figure 4
Using Simulations to Measure Unit, Revenue & Margin Changes
As might be expected, decreasing the price of Table Q (holding all other prices and options
constant) would result in moderate increases in overall numbers of units sold (solid line), smaller
increases in revenue (due to the lower weighted-average price with the Table Q price cut; dashed
line), and decreases in retail margin (dotted line). Note that these margin decreases would rapidly
become severe, since the absolute value of a price decrease is applied to a much smaller margin
base. (See curve configurations to the left of the crossover point in Figure 4.)
On the other hand, if the price of Table Q were to be increased, the effects would go in the
opposite direction in each case: margin would increase, and revenue and units would decrease.
(See curve configurations to the right of the crossover point in Figure 4.)
This Figure 4 graphic, along with the precise estimates of units, revenue, and margin changes
with various Table Q price adjustments, provided the account with various options to be
considered, in light of their retail objectives to balance unit, revenue, and margin objectives.
The second example involves the prospective introduction of a new and innovative version of
a furniture product (non-Lifetime) already sold by a given retailer. Three variants of the new
product were tested: Product A at high and moderate price levels and Product B (an inferior
19
version of Product A) at a relatively low price. Each of these product-price scenarios was
compared with the base case for the current product continuing to be sold by itself. And, in
contrast to the previous table example, only units (share) and revenue were measured in this
experiment (retail margin data for the existing product were not available). (See Figure 5)
Figure 5
Using Simulations to Inform the Introduction of a New Product
Note retailer’s increase in total
unit volume with introduction
of New Product
“Sweet
Spot”
Product/
Pricing
Options
Goodwin 9/18/13
18
The first result to note (especially from the retailer’s perspective) is the overall expansion of
unit sales under all three new-product-introduction scenarios. This would make sense, since in
each case there would be two options for consumers to consider, thus reducing the proportion of
retailers’ consumers who would not buy either option at this store (light gray bar portions at top).
The second finding of note (especially from Lifetime’s point of view) was that unit sales of
the new concept (black bars at the bottom) would be maximized by introducing Product A at the
moderate price. Of course, this would also result in the smallest net unit sales of the existing
product (dark gray bars in the middle).
Finally, the matter of revenue is considered. As seen in the revenue index numbers at the
bottom margin of the graphic (with current case indexed at 1.00), overall retail revenue for this
category would be maximized by introducing Product A at the high price (an increase of 26%
over current retail revenue). However, it also should be noted that introducing Product A at the
moderate price would also result in a sizable increase in revenue over the current case (+21%,
only slightly lower than that of the high price).
Thus Lifetime and retailer were presented with some interesting options (see the “Sweet
Spot” callout in Figure 5), depending on how unit and revenue objectives for both current and
new products were enforced. And, of course, they also could consider introduction of Product B
20
at its low price point, which would result in the greatest penetration among the retailer’s
customers, but at the expensive of almost zero growth in overall revenue.
#6. Maintain a Standard of Academic Rigor
Let’s face it: in a corporate-practitioner setting (especially as the sole conjoint analysis
“expert” in the company), it’s sometimes easy to become lackadaisical about doing conjoint
analysis right! It’s easy to consider taking short cuts. It’s easy to take the easy route with a
shorter questionnaire instead of the recommended number of ACBC Screening steps. And, it’s
easy to exclude holdout tasks in order to keep the survey length down and simplify the analysis.
Over the past year or so, we have concluded that, in order to prevent becoming too
complacent, a corporate practitioner of conjoint analysis may need to proactively maintain a
standard of academic rigor in his/her work. It is important to stay immersed in conjoint analysis
principles and methodology through seminars, conferences (even if the technical/mathematical
details are a bit of a comprehension “stretch”), and the literature. And, in the final analysis,
there’s nothing like doing a paper for one of those conferences to re-institute best practices!
Ideally a paper such as this should include some element of “research on research” (experiment
with methods, settings, etc.) to stretch one’s capabilities even further.
II. 2013 STORAGE SHED ACBC CASE STUDY
It had been nearly five years since Lifetime’s last U.S. storage shed conjoint study when the
Company’s Lawn & Garden management team requested an update to the earlier study. In
seeking a new study, their objective was to better inform current strategic planning, tactical
decision-making, and sales presentations for this important category.
At the same time, this shed study “refresh” presented itself as an ideal vehicle as a case study
for the current Sawtooth Software conference paper to illustrate the Company’s recent progress
in its conjoint analysis practices. The specific objectives for including this case study in this
current paper are shown below.

Demonstrate a typical conjoint analysis done by a private practitioner in an industrial
setting.

Validate the new conjoint model by comparing market simulations with in-sample
holdout preference tasks (test-retest format).

Include a “research on research” split-sample test on the effects of three different ACBC
research design settings.
An overview of the 2013 U.S. Storage Shed ACBC study is included in the table below:
21
ACBC Instrument Example Screenshots
As many users and students of conjoint analysis know, Sawtooth Software’s Adaptive
Choice-based conjoint protocol begins with a Build-Your-Own (BYO) exercise to establish the
respondent’s preference positioning within the array of all possible configurations of the product
in question. (Figure 12, to be introduced later, illustrates this positioning visually.) Figure 6
shows a screenshot of the BYO exercise for the current storage shed conjoint study.
22
Figure 6
Build-Your-Own (BYO) Shed Exercise
The design with 9 non-price attributes and 25 total levels results in a total of 7,776 possible
product configurations. In addition, the range of summed prices from $199 to $1,474, amplified
by a random price variation factor of ±30 percent (in the Screening phase of the protocol, to
follow) provides a virtually infinite array of product-price possibilities. Note the use of
conditional graphics (shed rendering at upper right) to help illustrate three key attributes that
drive most shed purchase decisions (square footage, roof height, and materials of construction).
Following creation of the panelist’s preferred (BYO) shed design, the survey protocol asks
him/her to consider a series of “near-neighbor” concepts and to designate whether or not each
one is a possibility for purchase consideration. (See Figure 7 and, later, Figure 12.) In essence,
the subject is asked to build a consideration set of possible product configurations from which
s/he will ultimately select a new favorite design. This screening exercise also captures any non-
23
compensatory selection behaviors, as the respondent can designate some attribute levels as ones
s/he must have—or wants to exclude—regardless of price.
Figure 7
Shed Concept Screening Exercise
Note again the use of conditional graphics, which help guide the respondents by mimicking a
side-by-side visual comparison common to many retail shed displays.
Following the screening exercise and the creation of an individualized short list of possible
configurations, these concepts are arrayed in a multi-round “tournament” setting where the
panelist ultimately designates the “best” product-price option. Conditional graphics again help
facilitate these tournament choices. (See Figure 8)
24
Figure 8
Shed Concept “Tournament” Exercise
The essence of the conjoint exercise is not to derive the “best” configuration, however.
Rather, it is to discover empirically how the panelist makes the simulated purchase decision,
including the
-
relative importance of the attributes in that decision,
levels of each attribute that are preferred,
interaction of these preferences across all attributes and levels, and
implicit price sensitivity for these features, individually and collectively.
As we like to tell our clients trying to understand the workings and uses of conjoint analysis,
“It’s the journey—not the destination—that’s most important with conjoint analysis.”
ACBC Example Diagnostic Graphics
Notwithstanding the ultimate best use of conjoint analysis as a tool for market simulations,
there are a few diagnostic reports and graphics that help clients understand what the program is
doing for their respective study. First among these is the average attribute-importance
distribution, in this case derived through Hierarchical Bayes estimation of the individual partworths from the Multinomial Logit procedure. (See Figure 9)
25
Figure 9
Relative Importance of Attributes from HB Estimation
It should be noted that these are only average importance scores, and that the simulations
ultimately will take into account each individual respondent’s preferences, especially if those
preferences are far different from the average. Nevertheless, our clients (especially sales
managers who are reporting these findings to retail chain buyers) can relate to interpretations
such as “20 percent of a typical shed purchase decision involves—or is influenced by—the size
of the shed.”
Note in this graphic that price occupies well over one-third of the decision space for this
array of shed products. This is due in large part to the wide range of prices ($199 minus 30
percent, up to $1,474 plus 30 percent) necessary to cover the range of sheds from a 25-squarefoot sheet metal model with few add-on features up to a 100-square-foot wooden model with
multiple add-ons. Within a defined sub-range of shed possibilities, most consumers would
consider (say, plastic sheds in the 50-to-75-square-foot range, with several feature add-ons), the
relative importance of price would diminish markedly and the importance of other attributes
would increase.
A companion set of diagnostics to the importance pie chart above involves line graphs
showing the relative conjoint utility scores (usually zero-centered) showing the relative
preferences for levels within each attribute. Again, recognizing that these are only averages, they
provide a quick snapshot of the overall preference profile for attributes and levels. They also
provide a good diagnostic to see if there are any reversals (e.g., ordinal-scale levels that do not
follow a consistent progression of increasing or decreasing utility scores). (See Figure 10)
26
Figure 10
Average Conjoint Utility Scores from HB Estimation
Lifetime Shed Conjoint Study 2013
Average Conjoint Utility Comparison (selected attributes)
Survey Sampling Inc. - Nationwide n=643
80
60
Average Conjoint Utility
40
20
0
-20
-40
-60
-80
CONSTRUCTION
SQUARE FOOTAGE
ROOF HEIGHT
WALL STYLE
FLOORING
2 shelves
NOT included
Plywood
Plastic
NOT included
Brick-style
Siding-style
Plain
8 Feet
6 Feet
100 SF (c. 10'x10')
75 SF (c. 8 'x8 ')
50 SF (c. 7'x7')
25 SF (c. 5'x5')
Treated Wood
Steel-reinforced Resin
Sheet Metal
-100
SHELVING
The final diagnostic graphic we offer is the Price Utility Curve. (See Figure 11) It is akin to
the Average Level Utility Scores, just shown, except that (a) in contrast to most feature-based
attributes, its curve has a negative slope, and (b) it can have multiple, independently sloped curve
segments (eight in this case), using ACBC’s Piecewise Pricing estimation option. Our clients can
also relate to this as a surrogate representation for a demand curve, with varying slopes (price
sensitivity).
27
Figure 11
Price Utility Curve Using Piecewise Method
Lifetime Shed Conjoint Study 2013
Price Utility Curves using Piecewise Method: Negative Price Constraint
Survey Sampling Inc. - Nationwide n=571 net
Price Utility Score (higher = more preferred)
200
Relevant Range for Plastic Sheds (25-100 SF)
150
Wooden Sheds
100
50
Sheet Metal Sheds
0
-50
-100
Note possible
perceptual
breakpoint at
$999
-150
-200
-250
$0
$200
$400
$600
$800
$1,000 $1,200
Retail Price
$1,400
$1,600
$1,800
$2,000
OVERALL (571)
There are a few items of particular interest in this graphic. First, the differences in price
ranges among the three shed types are called out. Although the differentiation between sheet
metal and plastic sheds is fairly clear-up, there is quite a bit of overlap between plastic and
wooden sheds. Second, the price cut points have been set at $200 increments to represent key
perceptual price barriers (especially $1,000, where there appears to be a possible perceptual
barrier in the minds of consumers).
III. ACBC EXPERIMENTAL DESIGN AND TEST RESULTS
This section describes the experimental design of the 2013 Storage Shed ACBC study, and
the “research-on-research” question we attempt to answer. It also discusses the holdout task
specification and the measures used to determine the precision of the conjoint model in its
various test situations. Finally, the results of this experimental test are presented.
Split-Sample Format to Test ACBC Designs
The Shed study used a split-sample format with three Adaptive Choice design variants based
on incrementally relaxed definitions of the “near-neighbor” concept in the Screener section of the
ACBC protocol. We have characterized those three design variants as Version 1—Conservative
departure from the respondent’s BYO concept, Version 2—Moderate departure, and Version 3—
Aggressive departure. (See Figure 12)
28
Figure 12
Conservative to Aggressive ACBC Design Strategies
ACBC Design
Strategy:
Near-Neighbors
instead of “Full
Factorial”
Total multivariate attribute space (9 attributes)
with nearly 7,800 unique product combinations,
plus a virtually infinite number of prices)
BYO Shed Configuration
(respondent’s “ideal” shed)



VERSION 3 / “Aggressive”
Vary 4-5 attributes from
BYO concept per task
(n=228)
VERSION 2 / “Moderate”
Vary 3-4 attributes from
BYO concept per task
(n=210)
VERSION 1 / “Conservative”
Vary 2-3 attributes from
BYO concept per task
(n=205)
Adapted from Orme’s
2008 ACBC Beta Test
Instructional Materials
Each qualified panelist was assigned randomly to one of the three questionnaire versions. As
a matter of course, we verified that the demographic and product/purchase profiles of each of the
three survey samples were similar (i.e., gender, age, home ownership, shed ownership, shed
purchase likelihood, type of shed owned or likely to purchase, and preferred store for a shed
purchase).
Going into this experiment, we had several expectations regarding the outcome. First, we
recognized that Version 1—Conservative would be the least-efficient experimental design,
because it defined “near neighbor” very closely, and therefore the conjoint choice tasks would
include only product configurations very close to the BYO starting point (only 2 to 3 attributes
were varied from the BYO-selected concept to generate additional product concepts). At the
other end of the spectrum, Version 3—Aggressive would have the widest array of product
configurations (varying from 4 to 5 of the attributes from the respondent’s BYO-selected concept
to generate additional concepts), resulting in a more efficient design. This is borne out by Defficiency calculations provided by Bryan Orme of Sawtooth Software using the results of this
study. As shown in Figure 13, the design of the Version 3 conjoint experiment was 27% more
efficient than that of the Version 1 experiment.
29
Figure 13
Calculated D-efficiency of Design Versions
D-Efficiency
0.44
Version 2 (3-4 changes from BYO)
0.52
Version 3 (4-5 changes from BYO)
0.56
Increasing D-efficiency
Version 1 (2-3 changes from BYO)
Index
100
118
127
Calculations courtesy of Bryan Orme, using Sawtooth CVA and ACBC Version 8.2
Despite the statistical efficiency advantage of Version 3, we fully expected Version 1 to
provide the most accurate results. In thinking about the flow of the interview, we felt Version 1
would be the most user-friendly for a respondent, since most of the product configurations shown
in the Screening section would be very close to his/her BYO specification. The respondent would
feel that the virtual interviewer (“Kylie” in this case) is paying attention to his/her preferences,
and therefore would remain more engaged in the interview process and (presumably) be more
consistent in answering the holdout choices. In contrast, the wider array of product
configurations from the more aggressive Version 3 approach might be so far afield that the
panelist would feel the interviewer is not paying as much attention to previous answers. As a
result, s/he might become frustrated and uninvolved in the process, thereby registering lessreliable utility scores.
One of the by-products of this test was the expectation that those participating in the Version
1 questionnaire would see relatively more configurations they liked and would therefore bring
forward more options (“possibilities”) into the Tournament section. Others answering the Version
3 questionnaire would see fewer options they liked and therefore would bring fewer options
forward into the Tournament. As shown in Figure 14, this did indeed happen, with a slightly (but
significantly) larger average number of conjoint concepts being brought forward in Version 1
than in Version 3.
30
Figure 14
Distribution of Concepts Judged to be “a Possibility”
Lifetime Shed Conjoint Study 2013
Distribution of Shed Concepts Judged to be "a Possibility"
Survey Sampling Inc. - Nationwide n=643
100
Differences in mean number of
“possibilities” among 3 ACBC
versions are significant (P=.034)
90
80
Cumulative Percent
70
Version 3 / Aggressive:
Smaller numbers of
“possibilities”
60
50
Version 1 / Conservative:
Larger numbers of
“possibilities”
40
30
20
10
0
0
1
2
3
4
5
Version 1 (Mean=15.7)
6
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Number of Shed Concepts judged to be "a Possibility" (Max=28)
Version 2 (Mean=15.0)
Version 3 (Mean=14.3)
TOTAL (Mean=15.0)
Validation Using In-sample Holdout Questions
To validate the results of this survey experiment we used four in-sample holdout questions
containing three product concepts each. The same set of holdouts was administered to all
respondents in each of the three versions of the survey instrument. We also used a test-retest
verification procedure, where the same set of four questions was repeated immediately, but with
the order of presentation of concepts in each question scrambled.
A summary of this Holdout Test-Retest procedure is included in the table below:
In order to maximize the reality of the holdouts, we generated the product configuration
scenarios using real-life product options and prices (including extensive level overlap, in some
cases). Of the 12 concepts shown, five were plastic sheds (three based on current Lifetime
models and the other two based on typical competitor offerings), four were wooden sheds
(competitors), and three were sheet metal sheds (competitors). In an effort to test realistic level
overlap, the first holdout task (repeated in scrambled order in the fifth task) contained a Lifetime
plastic shed and a smaller wooden shed, both with the same retail price. Likewise, in the second
31
(and sixth) task Lifetime and one of its key plastic competitors were placed head-to-head. In
keeping with marketplace realities, both of these models were very similar, with the Lifetime
offering having a $100 price premium to account for three relatively minor product feature
upgrades. (As will be seen shortly, this scenario made for a difficult decision for many
respondents, and they were not always consistent during the test-retest reality check.)
To illustrate these holdout task-design considerations, two examples are shown below: Figure
15 (which compares Holdout Tasks #1 and #5) and Figure 16 (which compares Holdout Tasks #2
and #6).
Figure 15
In-Sample Holdout Tasks #1 & #5
In-Sample Holdout Tasks #1 & #5
Lifetime 8x10 Shed
Task #1 (First Phase)
TASK #1
Version 1
Version 2
Version 3
TOTAL
Concept 1
19.5%
16.2%
11.4%
15.6%
RJG 9/17/13 Revision
Concept 2
22.9%
20.0%
23.2%
22.1%
Task #5 (Second Phase/Scrambled)
Concept 3
57.6%
63.8%
65.4%
62.4%
TASK #5
Version 1
Version 2
Version 3
TOTAL
Concept 1
21.5%
20.0%
20.6%
20.7%
Concept 2
61.0%
64.8%
64.0%
63.3%
Concept 3
17.6%
15.2%
15.4%
16.0%
32
This set of holdout concepts nominally had the most accurate test-retest results of the four—
and provided the best predictive ability for the conjoint model as well. Note that the shares of
preference for the Lifetime 8x10 Shed are within one percentage point of each other. Also, note
that the Lifetime shed was heavily preferred over the comparably priced (but smaller and more
sparsely featured) wooden shed.
32
Figure 16
In-Sample Holdout Tasks #2 & #6
In-Sample Holdout Tasks #2 & #6
Lifetime 7x7 Shed
Task #2 (First Phase)
Task #6 (Second Phase/Scrambled)
Close competitor to Lifetime 7x7
TASK #2
Version 1
Version 2
Version 3
TOTAL
Concept 1
19.5%
25.7%
27.6%
24.4%
Concept 2
38.0%
35.7%
32.9%
35.5%
Concept 3
42.4%
38.6%
39.5%
40.1%
TASK #6
Version 1
Version 2
Version 3
TOTAL
Concept 1
36.1%
31.9%
32.5%
33.4%
Concept 2
30.2%
31.9%
31.1%
31.1%
Concept 3
33.7%
36.2%
36.4%
35.5%
RJG 9/17/13 Revision
33
Holdouts
3 & 7 &and
4 &the
8 had
moderately
reliable
replication
rates of the four—
This set of holdout concepts
nominally
had
least
accurate
test-retest
results
and provided the worst predictive ability for the conjoint model. Note that the shares of
preference for the Lifetime 7x7 Shed and those of the competitor 7x7 shed varied more between
the Test and Retest phases than the task in Figure 15. It is also interesting to note that the
competitor shed appeared to pick up substantial share of preference in the Retest phase when
both products are placed side-by-side in the choice task. This suggests that, when the differences
are well understood, consumers may not evaluate the $100 price premium for the more fully
featured Lifetime shed very favorably.
Test Settings and Conditions
Here are the key settings and conditions of the Shed conjoint experiment and validation:

Randomized First Choice simulation method:
o Scale factor (Exponent within the simulator) adjusted within version to minimize
errors of prediction (Version 1 = 0.55, Version 2 = 0.25, Version 3 = 0.35)
o Root Mean Square Error (rather than Mean Absolute Error) used in order to
penalize extreme errors

Piecewise price function with eight segments, with the price function constrained to be
negative

Hit Rates used all eight holdout concepts (both Test and Retest phases together)
33

Deleted 72 bad cases prior to final validation:
o “Speeders” (less than four minutes) and “Sleepers” (more than one hour)
o Poor Holdout Test-Retest consistency (fewer than two out of four tasks consistent)
o Discrimination/straight-line concerns (panelist designated no concepts or all 28
concepts as “possibilities” in the Screening section)
Note that the cumulative impact of all these adjustments on Hit Rates was about +4 to +5
percentages points (i.e., moved from low 60%’s to mid 60%’s, as shown below).
The validation results for each survey version—and each holdout set—are provided in Figure
17. In each case, the Root Mean Square Error and Hit Rate are reported, along with the
associated Test-Retest Rate (TRR).
Figure 17
Summary of Validation Results
Root MSE
Hit Rate
Test-Retest Rate
Version 1 / Conservative
4.4
64%
82%
Version 2 / Moderate
6.5
65%
81%
Version 3 / Aggressive
6.5
69%
85%
Holdouts 1 & 5
5.2
75%
87%
Holdouts 2 & 6
4.9
54%
73%
Holdouts 3 & 7
3.8
68%
85%
Holdouts 4 & 8
8.6
66%
85%
OVERALL
5.9
66%
83%
Across All 4 Sets of Holdouts
Across All 3 Questionnaire Versions
Here are some observations regarding the results on this table:
34

Overall, the Relative Hit Rate (RHR) was about as expected (66% HR / 83% TRR = 80%
RHR).

The nominally increasing hit rate from Version 1 to Version 3 was not expected (we had
expected it to decrease).

There was a general lack of consistency between Root MSEs and Hit Rates, suggesting a
lack of discernible impact of ACBC design (as tested) on precision of estimation.

Holdouts #2 & #6 were the most difficult for respondents to deal with, with a
substantially lower Hit Rate and Test-Retest Rate (but, interestingly, not the highest
RMSE!).
In order to determine statistically whether adjustments in the ACBC design had a significant
impact on ability to predict respondents’ choices, we generated the following regression model,
wherein we tested for significance in the two categorical (dummy) variables representing
incremental departures from the near-neighbor ACBC base case. We also controlled for overall
error effects, as measured by the holdout Test-Retest Rate. (See Figure 18)
Figure
18
Shed ACBC
Hit
Rate Model
Shed ACBC Hit Rate Model
?
Hit Rate =  (ACBC Version, Test-Retest Rate)
Empirical Regression Model:
HR {0-1} = .34 + .015 (V2 {0,1}) + .040 (V3 {0,1} + .36 TRR {0-1}
Note Constants:
V1 (Conservative) = .34; V2 (Moderate) = .355; V3 (Aggressive) = .38
Model Significance
Overall (F)
P=.000
Adjusted R2=0.070
V2 (Dummy) Coefficient (T)
P=.568
V3 (Dummy) Coefficient (T)
P=.114
Test-Retest Coefficient (T)
P=.000
9/17/13
HereRJGare
ourRevision
observations on this regression model:
36

Hit Rates increased only 1.5% and 4.0% from Version 1 (base case) to Versions 2 and 3,
respectively. Neither one of these coefficients was significant at the .05 level.

The error-controlling variable (Test-Retest Rate) was significant—with a positive
coefficient—suggesting that hit rates go up as respondents pay closer attention to the
quality and consistency of their responses.

Of course, the overall model is significant, but only because of Test-Retest controlling
variable.

Questionnaire version (i.e., aggressiveness of ACBC design) does NOT have a significant
impact on Hit Rates
KEY TAKEAWAYS
Validation procedures verify that the overall predictive value of the 2013 Storage Shed
ACBC Study is reasonable. The overall Relative Hit Rate was .80 (or .66 / .83), implying that
model predictions are 80% as good as the Test-Retest rate. Root MSEs were about 6.0—with
corresponding MAEs in the 4–5 range—which is generally similar to other studies of this nature.
35
The evidence does not support the notion that Version 1 (the current, conservative
ACBC design) provides the most valid results. Even controlling for test-retest error rates, there
was no statistical difference in hit rates among the three ACBC design approaches. (In fact, if
there were an indication of one design being more accurate than the other, one might argue that it
could be in favor of Version 3, the most aggressive approach, with its nominally positive
coefficient.) While this apparent lack of differentiation in results among the three ACBC designs
could be disappointing from a theoretical point of view, there are some positive implications:
For Lifetime Products: Since differences in predictive ability among the three test
versions were not significant, we can combine data sets (n=571) for better statistical
precision for client simulation applications.
For Sawtooth Software—and the Research Community in general: The conclusion
that there are no differences in predictive ability, despite using a variety of conjoint
design settings, could be a good “story to tell” about the robustness of ACBC procedures
in different design settings. This is especially encouraging, given the prospect of even
more-improved design efficiencies with the upcoming Version 8.3 of ACBC.
Lifetime Products’ experiences and learnings over the past few years suggest several key
takeaways, particularly for new practitioners of conjoint analysis and other quantitative
marketing tools.

Continue to explore and experiment with conjoint capabilities and design options.

Look for new applications for conjoint-driven market simulations.

Continuously improve your conjoint capabilities.

Don’t let it get too routine! Treat your conjoint work with academic rigor.
Robert J. Goodwin
36
REFERENCES
Goodwin, Robert J: Introduction of Quantitative Marketing Research Solutions in a Traditional
Manufacturing Company: Practical Experiences. Proceedings of the Sawtooth Software
Conference, March 2009, pp. 185–198.
Goodwin, Robert J: The Impact of Respondents’ Physical Interaction with the Product on
Adaptive Choice Results. Proceedings of the Sawtooth Software Conference, October 2010,
pp. 127–150.
Johnson, Richard M., Orme, Bryan K., Huber, Joel & Pinnell, Jon: Testing Adaptive ChoiceBased Conjoint Designs, 2005. Sawtooth Software Research Paper Series.
Orme, Bryan K., Alpert, Mark I. & Christensen, Ethan: Assessing the Validity of Conjoint
Analysis—Continued, 1997. Sawtooth Software Research Paper Series.
Orme, Bryan K.: Fine-Tuning CBC and Adaptive CBC Questionnaires, 2009. Sawtooth Software
Research Paper Series.
Special acknowledgement and thanks to:
Bryan Orme (Sawtooth Software Inc.)
Paul Johnson, Tim Smith & Gordon Bishop (Survey Sampling International)
Chris Chapman (Google Inc.)
Clint Morris & Vince Rhoton (Lifetime Products, Inc.)
37
CAN CONJOINT BE FUN?:
IMPROVING RESPONDENT ENGAGEMENT IN CBC EXPERIMENTS
JANE TANG
ANDREW GRENVILLE
VISION CRITICAL
SUMMARY
Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the
number of respondents for Choice Based Conjoint (CBC) studies in the era of on-line panels.
The results showed that respondents become less engaged in later tasks. Increasing the number
of choice tasks brought limited improvement in the model’s ability to predict respondents’
behavior, and actually decreased model sensitivity and consistency.
In 2012, we looked at how shortening CBC exercises impacts the individual level precision
of HB models, with a focus on the development of market segmentation. We found that using a
slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most
conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes.
However, a CBC exercise with only 8 to 10 tasks is still considered boring by many
respondents. In this paper, we looked at two ideas that may be useful in improving respondents’
enjoyment level:
1. Augmenting the conjoint exercise using adaptive/tournament based choices.
2. Sharing the results of the conjoint exercise.
Both of these interventions turn out to be effective, but in different ways. The
adaptive/tournament tasks make the conjoint exercise less repetitive, and at the same time
provide a better model fit and more sensitivity. Sharing results has no impact on the performance
of the model, but respondents did find the study more “fun” and more enjoyable to complete.
We encourage our fellow practitioners to review conjoint exercises from the respondent’s
point of view. There are many simple things we can do to make the exercise appealing, and
perhaps even add some “fun.” While these new approaches may not yield better models, simply
giving the respondent a more enjoyable experience, and by extension making him a happier
panelist, (and one who is less likely to quit the panel) would be a goal worth aiming for.
1. INTRODUCTION
In the early days of CBC, respondents were often recruited into “labs” to complete
questionnaires, either on paper or via a CAPI device. They were expected to take up to an hour to
complete the experiment and were rewarded accordingly. The CBC tasks, while more difficult to
complete than other questions (e.g., endless rating scale questions), were considered interesting
by the respondents. Within the captive environment of the lab, respondents paid attention to the
attributes listed and considered tradeoffs among the alternatives. Fatigue still crept in, but not
until after 20 or 30 such tasks.
39
Johnson & Orme (1996) was the earliest paper the authors are aware of to address the
suitable length of a CBC experiment. The authors determined that respondents could answer at
least 20 choice tasks without degradation in data quality. Hoogerbrugge & van der Wagt (2006)
was another paper to address this issue. It focused on holdout task choice prediction. They found
that 10–15 tasks are generally sufficient for the majority of studies. The increase in hit rates
beyond that number was minimal.
Today, most CBC studies are conducted online using panelists as respondents. CBC exercises
are considered a chore. In the verbatim feedback from our panelists, we see repeated complaints
about the length and repetitiveness of choice tasks.
Tang and Grenville (2010) examined the tradeoff between the number of choice tasks and the
number of respondents in the era of on-line panels. The results showed that respondents became
less engaged in later tasks. Therefore, increasing the number of choice tasks brought limited
improvement in the model’s ability to predict respondents’ behavior, and actually decreased
model sensitivity and consistency.
In 2012, we looked at how shortening CBC exercises affected the individual-level precision
of HB models, with a focus on the development of market segmentation. We found that using a
slightly smaller number of tasks was not harmful to the segmentation process. In fact, under most
conditions, a choice experiment using only 10 tasks was sufficient for segmentation purposes.
However, a CBC exercise with only 8 to 10 tasks is still considered boring by many
respondents. The GreenBook blog noted this, citing CBC tasks as number four in a list of the top
ten things respondents hate about market research studies.
40
http://www.greenbookblog.org/2013/01/28/10-things-i-hate-about-you-by-mr-r-e-spondent/
2. WHY “FUN” MATTERS?
An enjoyable respondent survey experience matters in two ways:
Firstly, when respondents are engaged they give better answers that show more sensitivity,
less noise and more consistency. In Suresh & Conklin (2010), the authors observed that faced
with the same CBC exercise, those respondents who received the more complex brand attribute
section chose “none” more often and had more price order violations. In Tang & Grenville
(2010), we observed later choice tasks result in more “none” selections. When exposed to a long
choice task, the respondents’ choices contained more noise, resulting in less model sensitivity
and less consistency (more order violations).
Secondly, today a respondent is often a panelist. A happier respondent is more likely to
respond to future invites from that panel. From a panelist retention point of view, it is important
to ensure a good survey experience. We at Vision Critical are in a unique position to observe this
dynamic. Vision Critical’s Sparq software enables brands to build insight communities (a.k.a.
brand panels). Our clients not only use our software, but often sign up for our service in
recruiting and maintaining the panels. From a meta analysis of 393 panel satisfaction surveys we
41
conducted for our clients, we found that “Survey Quality” is the Number Two driver of panelist
satisfaction, just behind “your input is valued” and ahead of incentives offered.
Relative Importance of panel service attributes:
The input you provide is valued
16%
The quality of the studies you receive
15%
The study topics
15%
The incentives offered by the panel
13%
The newsletters / communications that you receive
12%
The look and feel of studies
9%
The length of each study
8%
The frequency of the studies
8%
The amount of time given to respond to studies
6%
There are many aspects to survey quality, not the least of which is producing a coherent and
logical survey instrument/questionnaire and having it properly programmed on a webpage. A
“fun” and enjoyable survey experience also helps to convey the impression of quality.
3. OUR IDEAS
There are many ways a researcher can create a “fun” and enjoyable survey. Engaging
question types that make use of rich media tools can improve the look and feel of a webpage on
which the question is presented, and make it easier for the respondent to answer those questions.
Examples of that can be found in Reid et al. (2007).
Aside from improving the look, feel and functionality of the webpages, we can also change
how we structure the questions we ask to make the experience more enjoyable. Puleson &
Sleep’s (2011) award winning ESOMAR congress paper gives us two ideas.
42
The first is introducing a game-playing element into our questioning. In the context of
conjoint experiments, we consider how adaptive choice tasks could be used to achieve this. We
can structure conjoint tasks to resemble a typical game, so the tasks become harder as one
progresses through the levels. Orme (2006) showed how this could be accomplished in an
adaptive MaxDiff experiment. A MaxDiff experiment is where a respondent is shown a small set
of options, each described by a short description, and asked to choose the option he prefers most
as well as the option he prefers least. In a traditional MaxDiff, this task is followed by many
more sets of options with all the sets having the same number of options.
In an Adaptive MaxDiff experiment, this series of questioning is done in stages. While the
respondents see the traditional MaxDiff tasks in the first stage, those options chosen as preferred
“least” in stage 1 are dropped off in stage 2. The options chosen as preferred “least” in stage 2
are dropped off in stage 3, etc. The numbers of options used in the comparison in each stage get
progressively smaller, so there are changes in the pace of the questions. Respondents can also see
how their choices result in progressively more difficult comparisons. At the end, only the
favorites are left to be pitted against each other. Orme (2006) showed that respondents thought
this experience was more enjoyable.
This type of adaptive approach is also at work in Sawtooth Software’s Adaptive CBC
(ACBC) product. The third step in an ACBC experiment is a choice tournament based on all the
product configurations in a respondent’s consideration set.
Tournament Augmented Conjoint (TAC) has been tried before by Chrzan & Yardly (2009). In
their paper, the authors added a series of tournament tasks to the existing CBC tasks. However,
as the CBC section was already quite lengthy, with accurate HB estimates, the authors concluded
that the additional TAC tasks provided only modest and non-significant improvements, which
did not justify the extra time it took to complete the questionnaire. However, we hypothesize that
if we have a very short CBC exercise and make the tournament tasks quite easy (i.e., pairs), the
tournament tasks may bring more benefits, or at least be more enjoyable for the panelists.
Our second idea comes from Puleson & Sleep (2011), who offered respondents a “two-way
conversation.” From Vision Critical’s panel satisfaction research, we know that people join
panels to provide their input. While respondents feel good about the feedback they provide, they
want to know that they have been heard. Sharing the results of studies they have completed is a
tangible demonstration that their input is valued. Most panel operators already do this, providing
feedback on the survey results via newsletters and other engagement tools. However, we can go
further. News media websites often have quick polls where they pose a simple question to
anyone visiting the website. As soon as a visitor provides her answer, she can see the results from
all the respondents thus far. That is an idea we want to borrow.
Dahan (2012) showed an example of personalized learning from a conjoint experiment. A
medical patient completed a series of conjoint tasks. Once he finished, he received the results
outlining his most important outcome criterion. This helped the patient to communicate his needs
and concerns to his doctors. It could also help him make future treatment decisions. Something
like this could be useful for us as well.
43
4. FIELD EXPERIMENT
We chose a topic that is of lasting interest to the general population: dating. We formulated a
conjoint experiment to determine what women were looking for in a man.
Cells
The experiment was fielded in May 2013 in Canada, US, UK and Australia. We had a sample
size of n=600 women in each country. In each country, respondents were randomly assigned into
one of four experimental cells.
CBC (8 Tasks, Triples)
CBC (8 Tasks, Triples) + Shareback
CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs)
CBC (5 Task, Triples) + Tournament (4 Tasks, Pairs) + Shareback
n=
609
623
613
618
While the CBC only cells received 8 choice tasks, all of them were triples. The Tournament
cells had 9 tasks, 5 triples and 4 pairs. The amount of information, based on the number of
alternatives seen by each respondent, was approximately the same in all the cells.
We informed the respondents in the Shareback cells at the start of the interview that they
would receive the results from the conjoint experiment after it was completed.
Questionnaire
The questionnaire was structured as follows:
1. Intro/Interest in topic
2. All about you: Demos/Personality/Preferred activity on dates
3. What do you look for in a man?
o Personality
44
o BYO: your ideal “man”
4. Conjoint Exercise per cell assignment
5. Share Back per cell assignment
6. Evaluation of the study experience
A Build-Your-Own (BYO) task in which we asked the respondents to tell us about their ideal
“man” was used to educate the respondents on the factors and levels used in the experiment.
Tang & Grenville (2009) showed that a BYO task was effective in preparing respondents for
making choice decisions.
Vision Critical’s standard study experience module was used to collect the respondents’
evaluation data. This consisted of 4 attribute ratings measured on a 5-point agreement scale, and
any volunteered open-ended verbatim comments on the study topic and survey experience. The
four attribute ratings were:




Overall, this survey was easy to complete
I enjoyed filling out this survey
I would fill out a survey like this again
The time it took to complete the survey was reasonable
Factors & Levels
The following factors were included in our experiment. Note that body type images are used
in the BYO task only.
Attribute:
Level:
Level:
Level:
Level:
Level:
Age
Height
Much older than me,
Much taller than me
Big & Cuddly
A bit older than me
A little taller than me
Big & Muscly
About the same age
Same height as me
Athletic & Sporty
A bit younger than me
Shorter than me
Lean & Fit
Much younger than me
images used at the
BYO question only,
not in conjoint task
Body Type
Career
Activity
Attitude towards
Family/Kids
Personality
Flower Scale
Yearly Income
Notes:
Driven to succeed and make Works hard, but with a good Has a job, but it's only to pay
money
work/life balance
the bills
Prefers day to day life over
Exercise fanatic
Active, but doesn't overdo it
exercise
Happy as a couple
Wants a few kids
Wants a large family
Reliable & Practical
Flowers, even when you are
not expecting
Pretty low
Under $50,000
Under $30,000
Under £15,000
Funny & Playful
Flowers for the important
occasions
Low middle
$50,000 to $79,999
$30,000 to $49,999
£15,000 – £39,999
Sensitive & Empathetic
Flowers only when he’s
saying sorry
Middle
$80,000 to $119,999
$50,000 to $99,999
£40,000 – £59,999
Prefers to find work when
he needs it
Serious & Determined
Passionate & Spontaneous
"What are flowers?"
High middle
$120,000 to $159,999
$100,000 to $149,999
£60,000 – £99,999
Really high
$160,000 or more
$150,000 or more
£100,000 or more
Australia
US/Canada
UK
45
Screen Shots
The A CBC task was presented to the respondent as follow:
The adaptive/tournament tasks were formulated as follows:

Set 1
Set 2
Set 3
Set 4
46
Randomly order the 5 winners from the CBC tasks, label them as item 1 to item 5.
Item 1
Item 3
Item 5
winner from Set 2
Item 2
Item 4
Winner from Set 1
Winner from Set 3
Drop the loser from Set 1
Drop the loser from Set 2
Drop the loser from Set 3
The tournament task was shown as:
The personalized learning page was shown as follows:
Personalized learning was based on frequency count only. For each factor, we counted how
often each level was presented to that respondent and how often it was chosen when presented.
The most frequently chosen level was presented back to the respondents.
These results were presented mostly for fun—the counting analysis was not the best for
providing this kind of individual feedback. The actual profiles presented to each individual
respondent in her CBC tasks were not perfectly balanced; the Tournament cells, where the
winners were presented to each respondent, would also have added bias for the counting
analysis. If we wanted to focus on getting accurate individual results, something like an
47
individual level logit model would be preferred. However, here we felt the simple counting
method would be sufficient and it was easy for our programmers to implement.
Each respondent in the Shareback cells was also shown the aggregate results from her fellow
countrywomen who had completed the survey thus far in her experiment cell.
5. RESULTS
We built HB models for each of the 4 experimental cells separately. Part-worth utilities were
estimated for all the factors. Sawtooth Software’s CBC/HB product was used for the estimation.
Model Fit/Hit Rates
We deliberately did not design holdout tasks for this study. We wanted to measure the results
of making a study engaging, and using holdout tasks makes the study take longer to complete,
which tends to have the opposite effect. Instead of purposefully designed holdout tasks, we
randomly held out one of the CBC tasks to measure model fit. Since respondents tend to spend
much longer time at their first choice task, we decided to exclude the 1st task for this purpose.
For each respondent, one of her 2nd, 3rd, 4th and 5th CBC tasks was randomly selected as the
holdout task.
The hit rates for the Tournament cells (63%) were much higher than the CBC cells (54%).
That result was surprising at first, since we would expect no significant improvement in model
performance for the Tournament cells. However, while the randomly selected holdout task was
held out from the model, the winner from that task was still included in the tournament tasks,
which may explain the increased performance.
In order to avoid any influence of the random holdout task, we reran the models for the
tournament cells again, holding out information from the holdout task itself, and any tournament
tasks related to its winner. The new hit rates (52%) are now comparable to that of the CBC cells.
48
However, by holding out not only the selected random holdout task, but also at least one and
potentially as many as three out of the four tournament tasks, we might have gone too far in
withholding information from the modeling. Had full information been used in the modeling, we
expect the tournament cells would have a better model fit and be better able to predict
respondent’s choice behavior.
Respondents seem to agree with this. Those who participated in the tournament thought we
did a better job of presenting them their personalized learning information. While this
information is based on a crude counting analysis and has potential bias issues, it is still
comforting to see this result.
The improvement in model fit is also reflected in a higher scale parameter in the model, with
the tournament cells showing stronger preferences, i.e., less noise and higher sensitivity. The
graph below shows the simulated preference shares for a “man” for each factor at the designated
level one at a time (holding all other factors at neutral). The shares are rescaled so that the
average across the levels within each factor sums to 0.
49
“Fun” & Enjoyment
Respondents had a lot of fun during this study. The topbox ratings for all 4 items track much
higher than the ratings for the congressional politics CBC study used in our 2010 experiment.
Disappointingly, there are not any differences across the 4 experimental cells among these
ratings. We suspect this is due to the high interest in the topic of dating and the fact we went out
of way to make the experience a good one for all the cells. Had we tested these interventions in a
less interesting setting (e.g., smartphone), we think we would have seen larger effects.
Interestingly, we saw significant differences in the volunteered open-ended verbatim answers
from respondents. Many of these verbatim answers are about how they enjoyed the study
experience and had fun completing the survey. Respondents in the Shareback cells volunteered
more comments and more “fun”/enjoyment comments than the non-shareback cells.
50
While an increase of 6.7% to 9.0% appears to be only a small improvement, given that only
13% of the respondents volunteered any comments at all across the 4 cells, this reflects a
sizeable change.
6. CONCLUSIONS & RECOMMENDATION
Both of these interventions are effective, but in different ways. The adaptive/Tournament
tasks make the conjoint exercise less repetitive and less tedious, and at the same time provide
better model fit and more sensitivity in the results. While sharing results has no impact on the
performance of the model, the respondents find the study more fun and more enjoyable to
complete.
Should we worry about introducing bias with these methods? The answer is no. Adaptive
methods have been shown to give results consistent with the traditional approaches in many
different settings, both for Adaptive MaxDiff (Orme 2006) and numerous papers related to
ACBC. Aside from the scale difference, our results from the Tournament cells are also consistent
with that from the traditional CBC cells. Advising respondents that we would share back the
results of the findings also had no impact on their choice behaviors.
We encourage fellow practitioners to review conjoint exercises from the respondent’s point
of view. There are many simple things we can do to make the exercise appealing, and perhaps
even add “fun.” While these new approaches may not yield better models, simply giving the
respondent a more enjoyable experience, and by extension making him a happier panelist, would
be a goal worth aiming for.
In the words of a famous philosopher:
51
While conjoint experiments may not be enjoyable by nature, there is no reason respondents
cannot have a bit of fun in the process.
Jane Tang
REFERENCES
Chrzan, K. & Yardley, D. (2009), “Tournament-Augmented Choice-Based Conjoint” Sawtooth
Software Conference Proceedings.
Dahan, E. (2012), “Adaptive Best-Worst Conjoint (ABC) Analysis” Sawtooth Software
Conference Proceedings.
Hoogerbrugge, M. and van der Wagt, K. (2006), “How Many Choice Tasks Should We Ask?”
Sawtooth Software Conference Proceedings.
Johnson, R. and Orme, B. (1996), “How Many Questions Should You Ask In Choice-Based
Conjoint Studies?” ART Forum Proceedings.
Orme, B. (2006), “Adaptive Maximum Difference Scaling” Sawtooth Software Technical Paper
Library
Puleson, J. & Sleep, D. (2011), “The Game Experiments: Researching how gaming techniques
can be used to improve the quality of feedback from on-line research” ESOMAR Congress
2011 Proceedings
52
Reid, J., Morden, M. & Reid, A. (2007) “Maximizing Respondent Engagement: The Use of Rich
Media” ESOMAR Congress Full paper can be downloaded from
http://vcu.visioncritical.com/wpcontent/uploads/2012/02/2007_ESOMAR_MaximizingRespondentEngagement_ORIGINAL
-1.pdf
Suresh, N. and Conklin, M. (2010), “Quantifying the Impact of Survey Design Parameters on
Respondent Engagement and Data Quality” CASRO Panel Conference.
Tang, J. and Grenville, A. (2009), “Influencing Feature Price Tradeoff Decisions in CBC
Experiments,” Sawtooth Software Conference Proceedings.
Tang, J. & Grenville, A. (2010), “How Many Questions Should You Ask in CBC Studies?—
Revisited Again” Sawtooth Software Conference Proceedings.
Tang, J. & Grenville, A. (2012), “How Low Can You Go?: Toward a better understanding of the
number of choice tasks required for reliable input to market segmentation” Sawtooth
Software Conference Proceedings.
53
MAKING CONJOINT MOBILE:
ADAPTING CONJOINT TO THE MOBILE PHENOMENON
CHRIS DIENER1
RAJAT NARANG2
MOHIT SHANT3
HEM CHANDER4
MUKUL GOYAL5
ABSOLUTDATA
INTRODUCTION: THE SMART AGE
With “smart” devices like smartphones and tablets integrating the “smartness” of personal
computers, mobiles and other viewing media, a monumental shift has been observed in the usage
of smart devices for information access. The sales of smart devices have been estimated to cross
the billion mark in 2013. The widespread usage of these devices has impacted the research world
too.
A study found that 64% of survey respondents preferred smartphone surveys, 79% of them
preferring to do so due to the “on-the-go” nature of it (Research Now, 2012). Multiple research
companies have already started administering surveys for mobile devices, predominantly
designing quick hit mobile surveys to understand the reactions and feedback of consumers, onthe-go.

Prior research (“Mobile research risk: What happens to data quality when respondents
use a mobile device for a survey designed for a PC,” Burke Inc, 2013) has suggested that
when comparing the results of surveys adapted for mobile devices to those on personal
computers, respondent experience is poorer and data quality is comparable for surveys on
mobile and personal computers.
This prior research also discourages the use of complex research techniques like conjoint on
the mobile platform. This comes as no surprise, as conjoint has long been viewed as a complex
and slightly monotonous exercise from the respondent’s perspective. Mobile platform’s small
viewer interface and internet speed can act as potential barriers for using conjoint.
ADAPTING CONJOINT TO THE MOBILE PLATFORM
Recognizing the need to reach respondents who are using mobile devices, research
companies have introduced three different ways of conducting surveys on mobile platforms—
web browsers based, app based and SMS based. Of these three, web browser is the most widely
used, primarily due to the limited customization required to host the surveys simultaneously on
mobile platforms and personal computers. The primary focus of mobile-platform-based surveys
1
Senior Vice President, AbsolutData Intelligent Analytics [Email: [email protected]]
Senior Expert, AbsolutData Intelligent Analytics [Email: [email protected]]
3
Team Lead, AbsolutData Intelligent Analytics [Email: [email protected]]
4
Senior Analyst, AbsolutData Intelligent Analytics [Email: [email protected]]
5
Senior Programmer, AbsolutData Intelligent Analytics [Email: [email protected]]
2
55
is short and simple surveys like customer satisfaction, initial product reaction, and attitude and
usage studies.
However, currently the research industry is hesitant to conduct conjoint studies on mobile
platform due to concerns with:





Complexity—conjoint is known to be complex and intimidating exercise due to the
number of tasks and the level of detail shown in the concepts
Inadequate representation on the small screen—having large number and long concepts
on the screen can affect readability
Short attention span of mobile users
Possibility of a conjoint study with large number of attributes and tasks—if a large
number of attributes are being used, there is a possibility of the entire concept not being
shown on a single screen, requiring a user to scroll
Penetration of smartphones in a region
In this paper, we hypothesize that all these can be countered (with the exception of smart
phone penetration) by focusing on improving the aesthetics and simplifying the conjoint tasks, as
illustrated in Figure 1. Our changes include:


56
Improving aesthetics
o Coding the task outlay to optimally use the entire screen space
o Minimum scrolling to view the tasks
o Reduction of number of concepts being shown on a screen
Simplifying conjoint tasks
o Reduction in number of tasks
o Reduction of number of attributes on a screen
o Using simplified conjoint methodologies
Figure 1
We improve aesthetics using programming techniques. To simplify conjoint tasks and make
them more readable to the respondents, we customize several currently used techniques to adapt
them to the mobile platform and compare their performance.
CUSTOMIZING CURRENT TECHNIQUES
Shortened ACBC
Similar to ACBC, this method uses pre-screening to identify most important attributes for
each respondent. The attributes selected in the pre-screening stage qualify through to the Build
Your Own (BYO) section and the Near Neighbor section. We omitted the choice tournament in
order to reduce the number of tasks being evaluated.
ACBC is known to present more simplistic tasks and better respondent engagement by
focusing on high priority attributes. We further simplified the ACBC tasks by truncating the list
of attributes and hence reducing the length of the concepts on the screen. Also, the number of
concepts per screen was reduced to 2 to simplify the tasks for respondents. An example is shown
in Figure 2.
57
Figure 2
Shortened ACBC Screenshot
Pairwise Comparison Rating (PCR)
We customize an approach similar to the Conjoint Value Analysis (CVA) method. Similar to
CVA, this method shows two concepts on the screen. We show respondents a 9 point scale and
ask to indicate preference—4 points on the left/right indicating preference for product on the
left/right respectively. Rating point in the middle implies that neither of the products is preferred.
For the purpose of estimation, we convert the rating responses to:


Discrete Choice—If respondents mark either half of the scale, the data file (formatted as
CHO for utility estimation with Sawtooth Software) reports the concept as being selected.
Whereas, if they mark the middle rating point, it implies that they have chosen the None
option.
Chip Allocation—We convert the rating given by the respondents to volumetric shares in
the CHO file. In case of a partial liking of the concept (wherein the respondent marked
2/3/4 or 6/7/8), we allocate the rest of the share to the “none” option. So, for example, a
rating of 3 would indicate “Somewhat prefer left,” so 50 points would go to left concept
and 50 to none. Similarly, a rating of 4 would indicate 25:75 in favor of none.
We include chip allocation to understand the impact of a reduced complexity approach on the
results. Also, in estimating results using chip allocation methods, the extent of likeability of the
product can also be taken into account (as opposed to single select in traditional CBC). We use
two concepts per screen to simplify the tasks for respondents, as shown in Figure 3.
58
Figure 3
Pairwise Comparison Rating Screenshot
CBC (3 concepts per screen)
The 3-concept per screen CBC we employ is identical to CBC conducted on personal
computers (Figure 4). We do this to compare the data quality across platforms for an identical
method.
Figure 4
CBC Mobile (3 concepts) Screenshot
CBC (2 concepts per screen)
Similar to traditional CBC, we also include a CBC with only 2 concepts shown per screen.
This allows us to understand the result of direct CBC simplification, an example of which is
shown in Figure 5.
59
Figure 5
CBC mobile (2 concepts) Screenshot
Partial Profile
This method is similar to traditional Partial Profile CBC where we fix the primary attributes
on the screen and then rotated a set of secondary attributes.
We further simplify the tasks by reducing the length of the concept by showing a truncated
list of attributes and also by reducing the number of concepts shown per screen (2 concepts per
screen) as is shown in Figure 6.
Figure 6
Partial Profile Screenshot
RESEARCH DETAILS
The body of data collected to test our hypotheses and address our objectives is taken from
quantitative surveys we conducted in the US and India. In each country, we surveyed 1200
60
respondents (thanks to uSamp and IndiaSpeaks for providing sample in US and India
respectively). Each of our tested techniques was evaluated by 200 distinct respondents. Results
were also gathered for traditional CBC (3 concepts per screen) administered on personal
computer to compare as a baseline of data quality.
The topic of the surveys was the evaluation of various brands of tablet devices (a total of 9
attributes were evaluated). In addition to including the conjoint tasks, the surveys also explored
respondent reaction to the survey, as well as some basic demographics. The end of survey
questions included:




The respondent’s experience in taking the survey on mobile platforms
Validity of results from the researcher’s perspective
Evaluate the merits and demerits of each technique
Evaluate if the efficacy of the techniques differ according to online survey conducting
maturity of the region (lesser online surveys are conducted in India than in the US)
RESULTS
After removing the speeders and straightliners, we evaluate the effectiveness of the different
techniques from two perspectives—the researcher perspective (technical) and respondent
perspective (experiential). We compare the results for both countries side-by-side to highlight
key differences.
RESEARCHER’S PERSPECTIVE
Correlation analysis
Pearson correlations compare the utilities of each of the mobile methods with utilities of
CBC on personal computer. The results of the correlations are found in Table 1. Most of the
methods with the exception of PCR (estimated using chip allocation) show very high correlation
and thus appear to mimic the results displayed by personal computers. This supports the notion
that we are getting similar and valid results between personal computer and mobile platform
surveys.
61
Table 1. Correlation analysis with utilities of CBC PC
Holdout accuracy
We placed fixed choice tasks in the middle and at the end of the exercise for each method.
Due to the varying nature of the methods (varying number of concepts and attributes shown per
task), the fixed tasks were not uniform across methods, i.e., each fixed task was altered as per the
technique in question. For example, partial profile only had 6 attributes for 2 concepts versus a
full profile CBC which had 9 attributes for 3 concepts and hence the fixed tasks were designed
accordingly.
As displayed in Table 2, Holdout task prediction rates are strong and in the typical expected
range. All the methods customized for mobile platforms do better than CBC on personal
computers (with the exception of PCR—Chip Allocation). CBC with 3 tasks, either on Mobile or
on PC, did equally well.
Table 2. Hit Rate Analysis
Arrows indicate statistical significant difference from CBC PC
62
When we adjust these hit rates for the number of concepts presented, discounting the hit rates
for tasks with fewer concepts1, the relative accuracy of the 2 versus the 3 concept tasks shifts.
The adjusted hit rates are shown in Table 3. With adjusted hit rates, the 3 concept task gains the
advantage over 2 concepts. We interpret these unadjusted and adjusted hit rates together to
indicate that, by and large, the 2 and 3 concept tasks generate similar hit rates. Also in the larger
context of comparisons, except for PCR, all of the other techniques are very comparable to CBC
PC and do well.
Table 3. Adjusted Hit Rate Analysis
MAE
MAE scores displayed in Table 4 tell a similar story with simplified methods like CBC (2
concepts) and Partial Profile doing better than CBC on personal computer.
Table 4. MAE Analysis
1
We divided the hit rates with the probability of selection of each concept on a screen. E.g. for a task with two concepts and none
option, the random probability of selection will be 33.33%. Therefore, all hit rates obtained for fixed tasks with two concepts
were divided by 33.33% to get the index score
63
RESPONDENT’S PERSPECTIVE
Average Time
Largely, respondents took more time to evaluate conjoint techniques on the mobile platforms.
As displayed in Chart 1, shortened ACBC takes more time for respondents to evaluate,
particularly for the respondents from India. This is expected due to the rigorous nature of the
method. PCR also took a lot of time to evaluate, especially for the respondents from India. This
might indicate that a certain level of maturity is required from respondents for evaluation of
complex conjoint techniques. This is reflective of the fact that online survey conduction is still at
a nascent stage in India.
Respondents took the least amount of time to evaluate CBC Mobile (2 concepts) indicating
that respondents can comprehend simpler tasks quicker.
Chart 1. Average time taken (in mins)
Readability
Respondents largely find tasks to be legible on mobile. This might be attributed to the
reduced list of attributes being shown on the screen. Although, as seen in Chart 2, surprisingly,
CBC Mobile (3 concepts) also does great on this front, which means that optimizing screen space
on mobiles can go a long way in providing readability.
64
Chart 2. Readability of methods on PC/Mobile screens
Arrows indicate statistical significant difference from CBC PC
Ease of understanding
Respondents found the concepts presented on the mobile platform easy to understand and the
degree of understanding is comparable to conjoint on personal computers. Thus, conjoint
research can easily be conducted on mobile platforms too.
Chart 3. Readability of methods on PC/Mobile screen
Arrows indicate statistically significant difference from CBC PC
Enjoyability
US respondents found the survey to be significantly less enjoyable than their Indian
counterparts as displayed in Chart 4. This might be due to the fact that online survey market in
US is quite saturated as compared to Indian market, which is still nascent. Therefore, respondent
exposure to online surveys might be significantly higher in the US contributing to the low
enjoyability.
65
Chart 4. Enjoyability of methods on PC/Mobile screen
Arrows indicate statistical significant difference from CBC PC
Encouragement to give honest opinion
Respondents find that all the methods encouraged honest opinions in the survey.
Chart 5. Encouragement to give honest opinions
Arrows indicate statistical significant difference from CBC PC
Realism of tablet configuration
Respondents believe that the tablet configurations are realistic. As seen in Chart 6, all the
methods are more or less at par with CBC on personal computers. This gives us confidence in the
results because the same tablet configuration was used in all techniques.
66
Chart 6. Realism of tablet configuration
Arrows indicate statistical significant difference from CBC PC
SUMMARY OF RESULTS
On the whole, all of the methods we customized for the mobile platform did very well in
providing good respondent engagement and providing robust data quality.
Although conjoint exercises with 3 concepts perform well on data accuracy parameters, they
don’t bode well as far as the respondent experience is concerned. However, the negative effect on
respondent experience can be mitigated by optimal use of screen space and resplendent
aesthetics.
Our findings indicate that a conjoint exercise with 2 concepts is the best of the alternative
methods we tested in terms of enriching data quality as well as user experience. CBC with 2
concepts performs exceptionally well in providing richer data quality and respondent
engagement than CBC on personal computers. The time taken to complete the exercise is also at
par with that of CBC on PC. PCR (discrete estimation) does fairly well too. However, its
practical application might be debated, with other methods being equally robust than it, if not
more, and easier to implement.
One may consider lowering the number of attributes being shown on the screen in
conjunction with the reduction of the number of concepts by the usage of partial profile and
shortened ACBC exercise. Although these methods score high on data accuracy parameters,
respondents find them slightly hard to understand as full profile of the products being offered is
not present. However, once respondents cross the barrier of understanding, these methods prove
extremely enjoyable and encourage them to give honest responses. They also take a longer time
to evaluate. Therefore, these should be used in studies where the sole component of the survey
design is the conjoint exercise.
CONCLUSION
This paper shows that researchers can confidently conduct conjoint in mobile surveys.
Respondents enjoy taking conjoint surveys on their mobile, probably due its “on-the-go” nature.
67
Researchers might want to adopt simple techniques like screen space optimization and
simplification of tasks in order to conduct conjoint exercises on mobile platforms.
This research also indicates that the data obtained from conjoint on mobile platforms is
robust and mirrors data from personal computers to a certain extent (shown by high correlation
numbers). This research supports the idea that researchers can probably safely group responses
from mobile platforms and personal computers and analyze them without the risk of error.
Chris Diener
68
CHOICE EXPERIMENTS IN MOBILE WEB ENVIRONMENTS
JOSEPH WHITE
MARITZ RESEARCH
BACKGROUND
Recent years have witnessed the rapid adoption of increasingly mobile computing devices
that can be used to access the internet, such as smartphones and tablets. Along with this increased
adoption we see an increasing proportion of respondents complete our web-based surveys in
mobile environments. Previous research we have conducted suggests that these mobile
responders behave similarly to PC and tablet responders. However, these tests have been limited
to traditional surveys primarily using rating scales and open-ended responses. Discrete choice
experiments may present a limitation for this increasingly mobile respondent base as the added
complexity and visual requirements of such studies may make them infeasible or unreliable for
completion on a smartphone.
The current paper explores this question through two large case studies involving more
complicated choice experiments. Both of our case studies include design spaces based on 8
attributes. In one we present partial and full profile sets of 3 alternatives, and in the second we
push respondents even further by presenting sets of 5 alternatives each. For both case studies we
seek to understand the potential impact of conducting choice experiments in mobile web
environments by investigating differences in parameter estimates, respondent error, and
predictive validity by form factor of survey completion.
CASE STUDY 1: TABLET
Research Design
Our first case study is a web-based survey among tablet owners and intenders. The study was
conducted in May of 2012 and consists of six cells defined by design strategy and form factor.
The design strategies are partial and full profile, and the form factors are PC, tablet, and mobile.
The table below shows the breakdown of completes by cell.
PC
Tablet
Mobile
Partial
Profile
201
183
163
Full
Profile
202
91
164
Partial profile respondents were exposed to 16 choice sets with 3 alternatives, and full profile
respondents completed 18 choice sets with 3 alternatives. The design space consisted of 7
attributes with 3 levels and one attribute with 2 levels. Partial profile tasks presented 4 of the 8
attributes in each task. All respondents were given the same set of 6 full profile holdout tasks
after the estimation sets, each with 3 alternatives. Below is a typical full profile choice task.
69
Please indicate which of the following tablets you would be most likely to purchase.
Operating System
Apple
Windows
Android
Memory
8 GB
64 GB
16 GB
Included Cloud
5 GB
50 GB
None
Storage (additional at
extra cost)
Price
$199
$799
$499
Screen Resolution
High definition
Extra-high definition
High definition
display (200 pixels
display
display (200 pixels
per inch)
(300 pixels per inch)
per inch)
Camera Picture
5 Megapixels
0.3 Megapixels
2 Megapixels
Quality
Warranty
1 Year
3 Months
3 Years
Screen Size
7˝
10˝
5˝



Analysis
By way of analysis, the Swait-Louviere test (Swait & Louviere, 1993) is used for parameter
and scale equivalence tests on aggregate MNL models estimated in SAS. Sawtooth Software’s
CBC/HB is used for predictive accuracy and error analysis, estimated at the cell level, i.e., design
strategy by form factor. In order to account for demographic differences by device type, data are
weighted by age, education, and gender, with the overall combined distribution being used as the
target to minimize any distortions introduced through weighting.
Partial Profile Results
Mobile
m = 0.96
lA = 62 => Reject H1A
R2 = 0.81
Apple
Android
Tablet
m = 1.05
lA = 156 => Reject H1A
Tablet m = 1.00
Apple
PC
m = 1.00
R2 = 0.89
PC
m = 1.00
The Swait-Louviere parameter equivalence test is a sequential test of the joint null hypothesis
that scale and parameter vectors of two models are equivalent. In the first step we test for
equivalence of parameter estimates allowing scale to vary. If we fail to reject the null hypothesis
for this first step then we move to step 2 where we test for scale equivalence.
R2 = 0.94
Mobile
m = 0.94
lA = 36 => Reject H1A
In each pairwise test we readily reject the null hypothesis that parameters do not differ
beyond scale, indicating that we see significant differences in preferences by device type.
Because we reject the null hypothesis in the first stage of the test we are unable to test for
70
significant differences in scale. However, the relative scale parameters suggest we are not seeing
dramatic differences in scale by device type.
The chart below shows all three parameter estimates side-by-side. When we look at the
parameter detail, the big differences we see are with brand, price, camera, and screen size. Not
surprisingly, brand is most important for tablet responders with Apple being by far the most
preferred option. It is also not too surprising that mobile responders are the least sensitive to
screen size. This suggests we are capturing differences one would expect to see, which would
result in a more holistic view of the market when the data are combined.
0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60
App
And
Brand
Win
8GB 16GB 64GB 0GB
Hard Drive
5GB 50GB Att4
Cloud Storage
Price
PP PC
HD
XHD
Resolution
PP Tablet
0.3
2.0
5.0
Camera (Mpx)
3Mo
1Yr
Warranty
3Yr
5"
7"
10"
Screen Size
PP Mobile
We used mean absolute error (MAE) and hit rates as measures of predictive accuracy. The
HB parameters were tuned to optimize MAE with respect to the six holdout choice tasks by form
factor and design strategy. The results for the partial profile design strategy are shown in the
table below.
PP PC
PP Tablet
PP Mobile
Base
201
183
163
MAE
0.052
0.058
0.052
Hit Rate
0.60
0.62
0.63
In terms of hit rate both mobile and tablet responders are marginally better than PC
responders, although not significantly. While tablet responders have a higher MAE, mobile
responders are right in line with PC. At least in terms of in-sample predictive accuracy it appears
that mobile responders are at par with their PC counterparts. We next present out-of-sample
results in the table below.
Holdouts
PC
Tablet
Mobile
Average
Random
0.143
0.160
0.142
PC
0.052
0.103
0.060
0.082
Prediction Utilities
Tablet
Mobile
0.082
0.053
0.058
0.074
0.077
0.052
0.080
0.056
71
All respondents were presented with the same set of 6 holdout tasks. These tasks were used
for out-of-sample predictive accuracy measures by looking at the MAE of PC responders
predicting tablet responders holdouts, as an example. In the table above the random column
shows the mean absolute deviation from random of the choices to provide a basis from which to
judge relative improvement of the model. The remainder of the table presents MAE when
utilities estimated for the column form factor were used to predict the holdout tasks for the row
form factor. Thus the diagonal is the in-sample MAE and off diagonal is out-of-sample. Finally,
the average row is the average MAE for cross-form factor. For example, the average PC MAE of
0.082 is the average MAE using PC based utilities to predict tablet and mobile holdouts
individually.
Mobile outperforms both tablet and PC in every pairwise out-of-sample comparison. In other
words, mobile is better at predicting PC than tablet and better than PC at predicting tablet
holdouts. This can be seen at both the detail and average level. In fact, Mobile is almost as good
at predicting PC holdouts as PC responders.
Wrapping up the analysis of our partial profile cells we compared the distribution of RLH
statistics output from Sawtooth Software’s CBC/HB to see if there are differences in respondent
error by device type. Note that the range of the RLH statistic is from 0 to 1,000 (three implied
decimal places), with 1,000 representing no respondent error. That is, when RLH is 1,000
choices are completely deterministic and the model explains the respondent’s behavior perfectly.
In the case of triples, a RLH of roughly 333 is what one would expect with completely random
choices where the model adds nothing to explain observed choices. Below we chart the
cumulative RLH distributions for each form factor.
RLH Cumulative Distributions
Cumulative Percent
100%
PP PC
PP Tablet
PP Mobile
0%
0
200
400
600
800
1000
RLH
Just as with a probability distribution function, the cumulative distribution function (CDF)
allows us to visually inspect and compare the first few moments of the underlying distribution to
understand any differences in location, scale (variance), or skew. Additionally, the CDF allows us
to directly observe percentiles, thereby quantifying where excess mass may be and how much
mass that is.
72
The CDF plots above indicate the partial profile design strategy results in virtually identically
distributed respondent error by form factor. As this represents three independent CBC/HB runs
(models were estimated separately by form factor) we are assured that this is indeed not an
aggregate result.
Full Profile Results
R2 = 0.89
R2 = 0.81
R2 = 0.94
R2
R2
R2
Apple
= 0.72
Tablet m = 1.00
m = 1.00
= 0.81
Apple
PC
PC
m = 1.00
Partial profile results suggest mobile web responders are able to reliably complete smaller
tasks on either their tablet or smartphone. As we extend this to the full profile strategy the
limitations of form factor screen size, especially among smartphone responders, may begin to
impact quality of results. We take the same approach to analyzing the full profile strategy as we
did with partial profile, first considering parameter and scale equivalence tests.
Windows
PP R2
= 0.55
Apple
Android
Android
Price
Mobile
Tablet
m = 0.96
lA = 61 => Reject H1A
Mobile
m = 1.05
lA = 99 => Reject H1A
m = 0.94
lA = 138 => Reject H1A
As with the partial profile results, we again see preferences differing significantly beyond
scale. The pairwise tests above show even greater differences in aggregate utility estimates than
with partial profile results, as noted by the sharp decline in parameter agreement as measured by
the R2 fit statistic. However, while we are unable to statistically test for differences in scale we
again see relative scale parameter estimates near 1 suggesting similar levels of error.
0.60
0.40
0.20
0.00
-0.20
-0.40
-0.60
App
And
Brand
Win
8GB 16GB 64GB 0GB
Hard Drive
5GB 50GB Att4
Cloud Storage
FP PC
Price
HD
XHD
Resolution
FP Tablet
0.3
2.0
5.0
Camera (Mpx)
3Mo
1Yr
Warranty
3Yr
5"
7"
10"
Screen Size
FP Mobile
Studying the parameter estimate detail in the chart above, we see a similar story as before
with preferences really differing on brand, price, camera, and screen size. And again, the
differences are consistent with what we would expect given the market for tablets, PCs, and
73
smartphones. For all three device types Apple is the preferred tablet brand which is consistent
with the market leader position they enjoy. Android, not having a real presence (if any) in the PC
market is the least preferred for PC and tablet responders, which is again consistent with market
realities. Android showing a strong second position among mobile responders is again consistent
with market realities as Android is a strong player in the smartphone market. Tablet responders
also being apparently less sensitive to price is also what we would expect given the price
premium of Apple’s iPad.
In-sample predictive accuracy is presented in the table below and we again see hit rates for
tablet and PC responders on par with one another. However, under the full profile design strategy
mobile responders outperform PC responders in terms of both MAE and hit rate, with the latter
being significant with 90% confidence for a one tail test. In terms of MAE tablet responders
outperform both mobile and PC responders. In-sample predictive accuracy of tablet responders
in terms of MAE is most likely the result of brand being such a dominant attribute.
FP PC
FP Tablet
FP Mobile
Base
202
91
164
MAE
0.044
0.028
0.033
Hit Rate
0.70
0.70
0.76
Looking at out-of-sample predictive accuracy in the table below we see some interesting
results across device type. First off, the random choice MAE is consistent across device types
making the direct comparisons easier in the sense of how well the model improves prediction
over the no information (random) case. The average cross-platform MAE is essentially the same
for mobile and PC responders, again suggesting that mobile responders provide results that are
on par with PC, at least in terms of predictive validity. Interestingly, and somewhat surprisingly,
utilities derived from mobile responders are actually better at predicting PC holdouts than those
derived from PC responders.
Holdouts
PC
Tablet
Mobile
Average
Random
0.163
0.163
0.166
PC
0.044
0.056
0.059
0.058
Prediction Utilities
Tablet
Mobile
0.101
0.038
0.028
0.074
0.102
0.033
0.102
0.056
While PC and mobile responders result in similar predictive ability, tablet responders are
much worse at predicting out-of-sample holdouts. On average tablet responders show almost
twice the out-of-sample error as mobile and PC responders, and compared to in-sample accuracy
tablet out-of-sample has nearly 4 times the amount of error. This is again consistent with a
dominant attribute or screening among tablet responders.
If tablet responder choices are being determined by a dominant attribute then we would
expect to see the mass of the RLH CDF shifted to the right. The cumulative RLH distributions
are shown below for each of the three form factors.
74
RLH Cumulative Distributions
Cumulative Percent
100%
FP PC
FP Tablet
FP Mobile
0%
0
200
400
600
800
1000
RLH
As with partial profile, full profile PC and mobile groups show virtually identical respondent
error. However, we do see noticeably greater error variance among tablet responders, with
greater mass close to the 1,000 RLH mark as well as more around the 300 range. This suggests
that we do indeed see more tablet responders making choices consistent with a single dominating
attribute.
In order to explore the question of dominated choice behavior further, we calculated the
percent of each group who chose in a manner consistent with a non-compensatory or dominant
preference structure. A respondent was classified as choosing according to dominating
preferences if he/she always chose the option with the same level for a specific attribute as the
most preferred. For example, if a respondent always chose the Apple alternative we would say
their choices were determined by the brand Apple. This should be a rare occurrence if people are
making trade-offs as assumed by the model. The results from this analysis are in the table below.
Dominating Preference
Partial Profile
Full Profile
PC
Tablet Mobile
PC
Tablet Mobile
201
183
163
202
91
164
92
44
47
48
33
29
0.23
0.24
0.29
0.24
0.36
0.18
Base
Number
Percent
P-Values*
PC
0.770 0.141
Tablet
0.312
* P-Values based on two-tail tests
0.027
0.156
0.001
These results are for dominating preference as described above. In the upper table are the
summary statistics by form factor and design strategy. The “Base” row indicates the number of
respondents in that cell, “Number” those responding in a manner consistent with dominated
preferences, and “Percent” what percent that represents. For example, 23% of PC responders in
75
the partial profile strategy made choices consistent with dominated preferences. The lower table
presents p-values associated with the pairwise tests of significance between incidences of
dominated choices. Looking at the partial profile responders, this means that the 24% Tablet
versus the 23% PC responders has an associated p-value of 77% indicating the two are not
significantly different from one another.
It should not be surprising that we see no significant differences between form factors for the
partial profile in terms of dominated choices because of the nature of the design strategy. In half
of the partial profile exercises a dominating attribute will not be shown, forcing respondents to
make trade-offs based on attributes lower in their preference structure. However, when we look
at the full profile we do see significant differences by form factor. Tablet responders are much
more likely to exhibit choices consistent with dominated preferences or screening strategies than
either mobile or PC responders who are on par with one another. Differences are both significant
with 95% confidence as indicated by the bold p-values in the lower table.
CASE STUDY II: SIGNIFICANT OTHER
Research Design
The second case study was also a web-based survey, this time among people who were either
in, or interested in being in, a long-term relationship. We again have an eight attribute design,
although we increase the complexity of the experiment by presenting sets of five alternatives per
choice task. A typical task is shown below.
Which of these five significant others do you think is the best for you?
Attractiveness
Romantic/
Passionate
Honesty/Loya
lty
Funny
Intelligence
Political
Views
Religious
Views
Annual
Income
Not Very
Attractive
Not Very
Romantic/
Passionate
Mostly Trust
Very
Attractive
Somewhat
Romantic/
Passionate
Can’t Trust
Not Very
Attractive
Very
Romantic/
Passionate
Can’t Trust
Very Funny
Sometimes
Funny
Not Very
Smart
Strong
Republican
Religious—
Not Christian
Very Funny
Pretty Smart
Strong
Democrat
Christian
Brilliant
$15,000
$15,000
Strong
Republican
No
Religion/Sec
ular
$15,000



Very
Attractive
Not Very
Romantic/
Passionate
Completely
Trust
Not Funny
Somewhat
Attractive
Not Very
Romantic/
Passionate
Completely
Trust
Not Funny
Not Very
Smart
Strong
Democrat
Religious—
Not Christian
Pretty Smart
$40,000
Strong
Democrat
No
Religion/Sec
ular
$40,000


All attributes other than annual income are 3 level attributes. Annual income is a 5 level
attribute ranging from $15,000 to $200,000. There were two cells in this study, one for
76
estimation and one to serve as a holdout sample. The estimation design consists of 5 blocks of 12
tasks each. The holdout sample also completed a block of 12 tasks, and both estimation and
holdout samples completed the same three holdout choice tasks. The holdout sample consists of
only PC and tablet responders.
Given the amount of overlap with 5 alternatives in each task, combined with the amount of
space required to present the task, we expect mobile responders to be pushed even harder than
with the tablet study.
Analysis
We again employ aggregate MNL for parameter equivalence tests and Hierarchical Bayes via
Sawtooth Software’s CBC/HB for analysis of respondent error and predictive accuracy.
However, in contrast to the tablet study we did not set individual quotas for device type, so a lack
of reasonable balance results in taking a matching approach to analysis rather than simply
weighting by demographic profiles. These matching dimensions are listed in the table below.
Matching Dimension
Design Block
Age
Gender
Children in House
Income
Cells
1–5
18–34, 35+
Male/Female
Yes/No
<$50,000, $50,000+
In each comparison with PC, we used simple random sampling to select PC responders
according to the tablet or mobile responder distribution over the above dimensions. For example,
if we had 3 mobile responders who completed block 2, were females between 18 and 34 years
old with children at home and making more than $50,000 per year, we randomly selected 3 PC
responders with the exact same block-demographic profile. The table below shows the
breakdown of completes.
Matched Profiles
Total
Estimation
Holdout
Tablet
Mobile
1,860
1,378
482
727
771
Tablet
98
73
25
-
39
Mobile
88
88
0
52
-
PC
The total completes were comprised of 1,860 PC, 98 tablet, and 88 mobile responders. Of the
1,860 PC responders, 1,378 were part of the estimation cell and 482 were used for the holdout
sample. In the estimation sample, 727 PC responders had block-demographic profiles matching
at least one tablet responder, and 771 matching at least one mobile responder. The mobile versus
tablet responders comparisons are not presented due to the small number of respondents with
matching profiles.
As previously stated, we used simple random sampling to select a subset of PC responders
with block-demographic profile distributions identical to tablet or mobile, depending on
comparison. Respondents without matching profiles (in either set) were excluded from the
analysis. This process was repeated 1,000 times for parameter equivalent tests using aggregate
77
MNL, and 100 times for predictive accuracy and respondent error using Sawtooth Software’s
CBC/HB.
Results
Parameter and scale equivalence tests were performed at each of the 1,000 iterations
described in the analysis section. The charts below summarize the results of the comparison
between mobile and PC responders.
MNL Parameter Comparison
Mobile Relative Scale (m)
25%
72.4% > 1
Mobile
R2 = 0.95
S&L Test Results
1,000 Iterations
Fail to reject H1A: 19.5%
R2 = 0.84
Average PC
Fail to reject H1B: 76.9%
0%
0.5
1.0
1.5
2.0
In the first panel the average PC parameter estimates are plotted against the mobile parameter
estimates. We see a high degree of alignment overall with an R2 value of 0.95. However, the two
outliers point to the presence of a dominating attribute, so we also present the fit for the inner set
of lesser important attributes, where we still see strong agreement with an R2 value of 0.84,
which is a correlation of over 0.9. The middle panel shows the distribution of the relative scale
parameter estimated in the first step of the Swait-Louviere test with mobile showing slightly
higher scale about 72% of the time.
The right panel above summarizes the test results. Note that if PC and mobile responders
were to result in significantly different preferences or scale that we would expect to fail to reject
the null hypothesis no more than 5% of the time for tests at the 95% level of confidence. In both
H1A (parameter) and H1B (scale) we fail to reject the null hypothesis well in excess of 5% of the
time, indicating that we do not see significant differences in preferences or scale between PC and
Mobile. Looking at the detailed parameter estimates in the chart below further reinforces
similarity of data after controlling for demographics.
78
1.5
1.0
0.5
0.0
-0.5
Avg PC
-1.0
Mobile
Attractiveness
Romantic/
Passionate
Honesty/
Loyalty
Funny
Intelligence
Political
Views
Annual Income
No Religion
Non-Christian
Christian
Strong Democrat
Swing Voter
Strong Repub
Brilliant
Pretty Smart
Not Very
Very
Sometime
Not Funny
Complete
Mostly
Can't Trust
Very
Somewhat
Not Very
Very
Somewhat
Not Very
-1.5
Religious
Views
Comparing tablet and PC parameters and scale we see an even more consistent story. The
results are summarized in the charts below. Even when we look at the consistency between
parameter estimates on the lesser important inner attributes we have an R2 fit statistic of 0.94,
which is a correlation of almost 0.97. Over 90% of the time the relative Tablet scale parameter is
greater than 1, suggesting that we may be seeing slightly less respondent error among those
completing the survey on a tablet. However, as the test results to the right indicate, neither
parameter estimates nor scales differ significantly.
MNL Parameter Comparison
Tablet Relative Scale (m)
25%
90.7% > 1
Tablet
R2 = 0.98
S&L Test Results
1,000 Iterations
Fail to reject H1A: 75.1%
R2 = 0.94
PC
Fail to reject H1B: 51.8%
0%
0.5
1.0
1.5
2.0
Test results for mobile versus tablet also showed no significant differences in preferences or
scale. Although as noted earlier, those results are not presented due to available sample sizes and
that the story is sufficiently similar as to not add meaningfully to the discussion.
Turning to in-sample predictive accuracy, holdout MAE and hit rates are presented in the
table below.
79
Base
MAE
Tablet
73
0.050
Mobile
88
0.037
* Mean after 100 iterations
PC Matched
MAE*
Hit Rate*
0.039
0.53
0.034
0.54
Hit Rate
0.53
0.53
In the table, the first MAE and Hit Rate columns refer to the results for the row form factor
responders. For example, among tablet responders the in-sample holdout MAE is 0.050 and hit
rate is 53%. The PC Matched MAE and Hit Rate refer to the average MAE and hit rate over the
iterations matching PC completes to row form factor responders. In this case, the average MAE
for PC responders matched to tablet is 0.039, with a mean hit rate of 53%. Controlling for
demographic composition and sample sizes brings all three very much in line with one another in
terms of in-sample predictive accuracy, although tablet responders appear to be the least
consistent internally with respect to MAE.
Out-of-sample predictive accuracy shows a similar story for mobile compared to PC
responders. Once we control for sample size differences and demographic distributions, mobile
and PC responders have virtually the same out-of-sample MAE. PC responders when matched to
tablet did show a slightly higher out-of-sample MAE than actual tablet responders, although we
do not conclude this to be a substantial strike against the form factor. Out-of-sample results are
summarized below.
MAE
Tablet
0.050
Mobile
0.037
* Mean after 100 iterations
PC Matched
MAE*
0.039
0.034
The results thus far indicate that mobile and tablet responders provide data that is at least on
par with PC responders in terms of preferences, scale, and predictive accuracy. To wrap up the
results of our significant other case study we look at respondent error as demonstrated with the
RLH cumulative distributions in the chart below.
RLH Cumulative Distribution
100%
Mobile
Tablet
PC - Mobile
PC - Tablet
0%
0
80
200
400
600
800
1000
We again see highly similar distributions of RLH by form factor. The PC matched cumulative
distribution curves are based on data from all 100 iterations, which explains the relative smooth
shape of the distribution. There is possibly a slight indication that there is less error among the
mobile and tablet responders, although we do not view this as substantially different. The slight
shift of mass to the right is consistent with relative scale estimates in our parameter equivalence
tests, which were not statistically significant.
CONCLUSION
In both our tablet and significant other studies we see similar results regardless of which
form factor the respondent chose to complete the survey. However, the tablet study does indicate
the potential for capturing differing preferences by device type of survey completion. Given the
context of that study, this finding is not at all surprising, and in fact is encouraging in that we are
capturing more of the heterogeneity in preferences we would expect to see in the marketplace. It
would be odd if tablet owners did not exhibit different preferences than non-owners given the
experience with usage. On the other hand, we observe the same preferences regardless of form
factor in the significant other study, which is what we would expect for a non-technical topic
unrelated to survey device type.
More importantly than preference structures, which we should not expect to converge a
priori, both of our studies indicate that the quality of data collected via smartphone is on par
with, or even slightly better than, that collected from PC responders. In terms of predictive
accuracy, both in and out-of-sample, and respondent error, we can be every bit as confident in
choice experiments completed in a mobile environment as in a traditional PC environment.
Responders who choose to complete surveys in a mobile environment are able to do so reliably,
and we should therefore not exclude those folks from choice experiments based on assumptions
of the contrary. In light of the potential for capturing different segments in terms of preferences,
we should actually welcome the increased diversity offered by presenting choice experiments in
different web environments.
81
Joseph White
REFERENCES
Swait, J., & Louviere, J. (1993). The Role of the Scale Parameter in the Estimation and
Comparison of Multinomial Logit Models. Journal of Marketing Research, 30(3), 305–314.
82
USING COMPLEX MODELS TO DRIVE BUSINESS DECISIONS
KAREN FULLER
HOMEAWAY, INC.
KAREN BUROS
RADIUS GLOBAL MARKET RESEARCH
ABSTRACT
HomeAway offers an online marketplace for vacation travelers to find rental properties.
Vacation home owners and property managers list rental property on one or more of
HomeAway’s websites. The challenge for HomeAway was to design the pricing structure and
listing options to better support the needs of owners and to create a better experience for
travelers. Ideally, this would also increase revenues per listing. They developed an online
questionnaire that looked exactly like the three pages vacation homeowners use to choose the
options for their listing(s). This process nearly replicated HomeAway’s existing enrollment
process (so much so that some respondents got confused regarding whether they had completed a
survey or done the real thing). Nearly 2,500 US-based respondents completed multiple listings
(MBC tasks), where the options and pricing varied from task to task. Later, a similar study was
conducted in Europe. CBC software was used to generate the experimental design, the
questionnaire was custom-built, and the data were analyzed using MBC (Menu-Based Choice)
software. The results led to specific recommendations for management, including the use of a
tiered pricing structure, additional options, and an increase in the base annual subscription price.
After implementing many of the suggestions of the model, HomeAway has experienced greater
revenues per listing and the highest renewal rates involving customers choosing the tiered
pricing.
THE BUSINESS ISSUES
HomeAway Inc., located in Austin Texas, is the world’s largest marketplace for vacation
home rentals. HomeAway sites represent over 775,000 paid listings for vacation rental homes in
171 countries. Many of these sites recently merged under the HomeAway corporate name. For
this reason, subscription configurations could differ markedly from site-to-site.
Vacation home owners and property managers list their rental properties on one or more
HomeAway sites for an annual fee. The listing typically includes details about the size and
location of the property, photos of the home, a map, availability calendar and occasionally a
video. Travelers desiring to rent a home scan the listings in their desired area, choose a home,
contact and rent directly from the owner and do not pay a fee to HomeAway. HomeAway’s
revenues are derived solely from owner subscriptions. Owners and property managers have a
desire to enhance their “search position,” ranking higher in the available listings, to attract
greater rental income.
HomeAway desired to create a more uniform approach for listing properties across its
websites, enhance the value and ease of listing for the owner, and encourage owners to provide
high quality listings while creating additional revenue.
83
THE BUSINESS ISSUE
The initial study was undertaken in the US for sites under the names HomeAway.com and
VRBO.com. The HomeAway.com annual subscription included a thumbnail photo next to the
property listing, 12 photos of the property, a map, availability calendar and a video. Owners
could upload additional photos if desired. The search position within the listings was determined
by an algorithm rating the “quality” of the listing. The VRBO.com annual subscription included
four photos. Owners could pay an additional fee to show more photos which would move their
property up in the search results. With the purchase of additional photos came enhancements
such as a thumbnail photo, map and video.
The business decision entailed evaluating an alternative tiered pricing system tied to the
position on the search results (e.g., Bronze, Silver and Gold) versus alternative tiered systems
based on numbers of photos.
THE STUDY DESIGN
The designed study required 15 attributes arrayed as follows using an alternative-specific
design through Sawtooth Software’s CBC design module:







Five alternative “Basic Listing” options:
o Current offer based on photos
o Basic offer includes fewer photos with the ability to pay for extra photos and
obtain “freebies” (e.g., thumbnail photo) and improve search position
o Basic offer includes fewer photos and includes “freebies” (e.g., thumbnail photo).
The owner can “buy up” additional photos to improve search position
o Basic offer includes many photos but no “freebies.” Pay directly for specific
search position and obtain “freebies.”
o Basic offer includes many photos and “freebies.” Pay directly for specific search
position.
Pricing for Basic Offers—five alternatives specific to the “Basic Listing”
“Buy Up” Tiers offered specific to Basic Listing offer—3, 7 and 11 tiers
Tier prices—3 levels under each approach
Options to list on additional HomeAway sites (US only, Worldwide,
US/Europe/Worldwide options)
Prices to list on additional sites—3 price levels specific to option
Other listing options (Directories and others)
THE EXERCISE
Owners desiring to list a home on the HomeAway site select options they wish to purchase
on a series of three screens. For this study the screens were replicated to closely resemble the
84
look and functionality of the sign-up procedure on the website. These screens are shown in the
Appendix.
Additionally, the search position shown under the alternative offers was customized to the
specific market where the rental home was located. In smaller markets “buying up” might put the
home in the 10th position out of 50 listings; in other larger markets the same price might only list
the home 100th out of 500 listings. As the respondent moved from one screen to the next the
“total spend” was shown. The respondent always had the option to return to a prior screen and
change the response until the full sequence of three screens was complete. Respondents
completed eight tasks of three screens.
THE INTERVIEW AND SAMPLE
The study was conducted through an online interview in the US in late 2010/early 2011
among current and potential subscribers to the HomeAway service. The full interview ran an
average of 25 minutes.



903 current HomeAway.com subscribers
970 current VRBO.com subscribers
500 prospective subscribers who rent or intend to rent a home to vacationers and do not
list on a HomeAway site
Prospective subscribers were recruited from an online panel.
THE DATA
Most critical to the usefulness of the results is an assurance that the responses are realistic,
that respondents were not overly fatigued and were engaged in the process.
To this end, the median and average “spend” per task are examined and shown in the table
below:
These results resemble closely the actual spend among current subscribers. Additionally,
spend by task did not differ markedly. A large increase/decrease in spend in the early/later tasks
might indicate a problem in understanding or responding to the tasks.
The utility values for each of the attribute levels were estimated using Sawtooth Software’s
HB estimation. Meaningful interactions were embedded into the design. Additional cross-effects
were evaluated.
85
To further evaluate data integrity HB results were run for all eight tasks in total, the first six
tasks in total and the last six tasks in total.
The results of this exercise indicate that using all eight tasks was viable. Results did not
differ in a meaningful way when beginning or ending tasks were dropped from the data runs. The
results for several of the attributes are shown in the following charts.
86
THE DECISION CRITERIA
Two key measures were generated for use by HomeAway in their financial models to
implement a pricing strategy—a revenue index and a score representing the appeal of the offer to
homeowners.
These measures were generated in calculations in an Excel-based simulator, an example of
which is shown below:
87
N =1470
N = 829
N = 310
N = 147
N = 184
Total
Small
Medium
Large
Extra Large
Exercise
Appeal Score of Option:
84.1
86.5
83.1
78.5
79.5
Subgroup
Revenue Index:
115.0
111.7
116.8
124.0
119.7
Basic Listing Type:
Price:
How you buy up:
Additonal Price:
5 Photos
6 Photos
7 Photos
8 Photos
9 Photos
10 Photos
11 Photos
12 Photos
13 Photos
14 Photos
15 Photos
16 Photos
Listings on additional sites:
Basic
US
Base Price
$30
$60
$90
$120
$150
$180
$210
$240
$270
$300
$330
Total
68.2%
3.0%
0.0%
0.0%
0.0%
0.0%
0.0%
12.8%
0.0%
0.0%
0.0%
16.1%
Small
73.2%
2.9%
0.0%
0.0%
0.0%
0.0%
0.0%
10.4%
0.0%
0.0%
0.0%
13.7%
Medium
66.5%
3.3%
0.0%
0.0%
0.0%
0.0%
0.0%
14.6%
0.0%
0.0%
0.0%
15.6%
Large
51.2%
1.6%
0.0%
0.0%
0.0%
0.0%
0.0%
22.2%
0.0%
0.0%
0.0%
25.6%
Extra Large
62.4%
3.9%
0.0%
0.0%
0.0%
0.0%
0.0%
13.0%
0.0%
0.0%
0.0%
20.6%
Total
64.3%
35.8%
Small
70.7%
29.4%
Medium
60.8%
39.2%
Large
45.1%
54.9%
Extra Large
56.9%
43.1%
Additional Price:
No Additional Price
100
Featured Listing:
No Additional Listing
88
`
Featured Listing
1 Month
3 Months
6 Months
12 Months
plus $49
plus $99
plus $149
plus $199
Total
0.0%
11.5%
0.1%
9.3%
Small
0.0%
9.9%
0.1%
8.5%
Medium
0.0%
12.2%
0.2%
10.4%
Large
0.0%
13.8%
0.0%
13.3%
Extra Large
0.0%
15.9%
0.0%
7.8%
Featured Directory
Golf Directory
Ski Directory
$59 for 12 months
$59 for 12 months
4.3%
4.1%
2.8%
2.6%
4.1%
4.1%
3.0%
3.0%
12.3%
12.1%
Additonal Features
Special Offer
$20 per week
4.4%
2.8%
4.1%
3.0%
13.1%
In this simulator, the user can specify the availability of options for the homeowner, pricing
specific to each option and the group of respondents to be studied.
The appeal measures are indicative of the interest level for that option among homeowners in
comparison to the current offer. The revenue index is a relative measure indicating the degree to
which the option studied might generate revenue beyond the “current” offer (Index = 100). The
ideal offer would generate the highest appeal while maximizing revenue. The results in the
simulator were weighted to reflect the proportion of single and dual site subscribers and potential
prospects (new listings) according to current property counts.
BUSINESS RECOMMENDATIONS
The decision whether to move away from pricing based on the purchase of photos in a listing
to an approach based on direct purchase of a listing “tier” was critical for the sites studied as well
as other HomeAway sites. Based on these results, HomeAway chose to move to a tiered pricing
approach. Both approaches held appeal to homeowners but the tiered approach generated the
greater upside revenue potential. Additional research also indicated that the tiered system
provided a better traveler experience in navigating the site.
While the study evaluated three, seven and eleven tier approaches, HomeAway chose a five
tier approach (Classic, Bronze, Silver, Gold and Platinum). Fewer tiers, generally, outperformed
higher tier offers. The choice of five tiers offered HomeAway greater flexibility in its offer.
Price tiers were implemented in market at $349 (Classic); $449 (Bronze); $599 (Silver); $749
(Gold) and $999 (Platinum). Each contained “value-added” features bundled in the offer to allow
for greater price flexibility. These represent a substantive increase in the base annual subscription
prices.
HomeAway continued to offer cross-sell options and additional listing offers (feature
directories, feature listings and other special offers) to generate additional revenue beyond the
base listing.
SOME LESSONS LEARNED FOR FUTURE MENU-BASED STUDIES
Substantial “research” learning was also generated through this early foray into menu-based
choice models.
 We believe that one of the keys to the success of this study was the “strive for realism” in
the presentation of the options to respondents. (The task was sufficiently realistic that
HomeAway received numerous phone calls from its subscribers asking why their
“choices” in the task had not appeared in their listings.) Realism was implemented not
only in the “look” of the pages but also in the explanations of the listing positions
calculated based on their own listed homes.
 Also critical to success of any menu-based study is the need to strike a “happy medium”
in terms of number of variables studied and overall sample size.
o While the flexibility of the approach makes it tempting to the researcher to
“include everything,” parsimony pays. Both the task and the analysis can be
overwhelming when non-critical variables to the decision are included.
89
Estimation of cross-effects is also challenging. Too many cross-effects quickly
result in model over-specification resulting in cross-cancellations of the needed
estimations.
o Sufficient sample size is likewise critical but too much sample is likewise
detrimental. In the study design keep in mind that many “sub-models” are
estimated and sample must be sufficient to allow a stable estimation at the
individual level. Too much sample however presents major challenges to
computing power and ultimately simulation.
 In simulation it is important to measure from a baseline estimate. This is a research
exercise with awareness and other marketing measures not adequately represented.
Measurement from a baseline levels the playing field for these factors having little
known effect providing confidence in the business decisions made. This is still survey
research and we expect a degree of over-statement by respondents. Using a “baseline”
provides a consistency to the over-statement.
IN-MARKET EXPERIENCE
HomeAway implemented the recommendations in market for its HomeAway and VRBO
businesses. Subsequent to the effort, the study was repeated, in a modified form, for HomeAway
European sites.
In market adoption of the tiered system exceeded the model predictions. Average revenue per
listing increased by roughly 15% over the prior year. Additionally, HomeAway experienced the
highest renewal rates among subscribers adopting the tiered system.
Brian Sharples, Co-founder and Chief Executive Officer noted:
“The tiered pricing research allowed HomeAway to confidently launch
tiered pricing to hundreds of thousands of customers in the US and
European markets. Our experience in market has been remarkably close
to what the research predicted, which was that there would be strong
demand for tiered pricing among customers. Not only were we able to
provide extra value for our customers but we also generated substantial
additional revenue for our business.”
90
Karen Fuller
Karen Buros
91
APPENDIX—SCREEN SHOTS FROM THE SURVEY
92
93
94
95
AUGMENTING DISCRETE CHOICE DATA—A Q-SORT CASE STUDY
BRENT FULLER
MATT MADDEN
MICHAEL SMITH
THE MODELLERS
ABSTRACT
There are many ways to handle conjoint attributes with many levels including progressive
build tasks, partial profile tasks, tournament tasks and adaptive approaches. When only one
attribute has many levels, an additional method is to augment choice data with data from other
parts of the survey. We show how this can be accomplished with a standard discrete choice task
and a Q-Sort exercise.
PROBLEM DEFINITION AND PROPOSED SOLUTION
Often, clients come to us with an attribute grid that has one attribute with a large number of
levels. Common examples include promotions or messaging attributes. Having too many levels
in an attribute can lead to excessive respondent burden, insufficient level exposure and nonintuitive results or reversals. To help solve the issue we can augment discrete choice data with
other survey data focused on that attribute. Sources for augmentation could include MaxDiff
exercises, Q-Sort and other ranking exercises, rating batteries and other stated preference
questions. Modeling both sets of data together allows us to get more information and better
estimates for the levels of the large attribute. We hope to find the following in our augmented
discrete choice studies:
1st priority—Best estimates of true preference
2nd priority—Better fit with an external comparison
3rd priority—Better holdout hit rates and lower holdout MAEs
Approaches like this are fairly well documented. In 2007 Hendrix and Drucker showed how
data augmentation can be used on a MaxDiff exercise with a large number of items. Rankings
data from a Q-Sort task were added to the MaxDiff information and used to improve the final
model estimates. In another paper in 2009 Lattery showed us how incorporating stated
preference data as synthetic scenarios to a conjoint study can improve estimation of individual
utilities, higher hit rates and more consistent utilities resulted. Our augmenting approach is
similar and we present two separate case studies below. Both augment discrete choice data with
synthetic scenarios. The first case study augments a single attribute that has a large number of
levels using data from a Q-Sort exercise about that attribute. The second case study augments
several binary (included/excluded) attributes with data from a separate scale rating battery of
questions.
CASE STUDY 1 STRUCTURE
We conducted a telecom study with a discrete choice task trading off attributes such as
service offering, monthly price, additional fees and contract type. The problematic attribute listed
97
promotion gifts that purchasers would receive for free when signing up for the service. This
attribute had 19 levels that the client wanted to test. We were concerned that the experimental
design would not give sufficient coverage to all the levels of the promotion attribute and that the
discrete choice model would yield nonsensical results. We know from experience that ten levels
for one attribute is about the limit of what a respondent can realistically handle in a discrete
choice exercise. The final augmented list is shown in Table 1.
Table 1. Case Study 1 Augmented Promotion Attribute Levels
Augmentation List
$100 Gift Card
$300 Gift Card
$500 Gift Card
E-reader
Gaming Console 1
Tablet 1
Mini Tablet
Tablet 2
Medium Screen TV
Small Screen TV
Gaming Console 2
HD Headphones
Headphones
Home Theatre Speakers
3D Blu-Ray Player
12 month Gaming Subscription
We built a standard choice task (four alternatives and 12 scenarios) with all attributes. Later
in the survey respondents were asked a Q-Sort exercise with levels from the free promotion gift
attribute. Our Q-Sort exercise included the questions below to obtain a multi-step ranking. These
ranking questions took one and a half to two minutes for respondents to complete.
1) Which of the following gifts is most appealing to you?
2) Of the remaining gifts, please select the next 3 which are most appealing to you.
(Select 3 gifts)
3) Of the remaining gifts, which is the least appealing to you? (Select one)
4) Finally, of the remaining gifts, please select the 3 which are least appealing to you
(Select 3 gifts)
In this way we were able to obtain promotion gift ranks for each respondent. We coded the
Q-Sort choices into a discrete choice data framework as a series of separate choices and
appended these as extra scenarios within the standard discrete choice data. Based on ranking
comparisons the item chosen as the top rank was coded to be chosen compared to all others. The
items chosen as “next top three” each beat all the remaining items. The bottom three each beat
the last ranked item in pairwise scenarios. We estimated two models to begin with, one standard
discrete choice model and a discrete choice with the additional Q-Sort scenarios.
98
CASE STUDY 1 RESULTS
As expected, the standard discrete choice model without the Q-Sort augmentation yielded
nonsensical results for the promotion gift attribute. Some of the promotions we tested included
prepaid gift cards. As seen in Table 2, before integrating the Q-Sort data, we saw odd reversals,
for example, the $100 and $300 prepaid cards were preferred over the $500 card on many of the
individual level estimates. When the Q-Sort augment was applied to the model the reversals
disappeared almost completely. The prepaid card ordering was logical (the $500 card was most
preferred) and rank-ordering made sense for other items in the list.
Table 2. Case Study 1 Summary of Individual Level Reversals
Individual
Reversals
DCM Only
DCM + Q-Sort
$100 > $300
$300 > $500
$100 > $500
59.8%
0.0%
60.8%
0.8%
82.3%
0.0%
As a second validation, we assigned approximate MSRP figures to the promotions, figuring
they would line up fairly well with preferences. As seen in Figure 1, when plotting the DCM
utilities against the MSRP values, the original model had a 29% r-square. After integrating the QSort data, we saw the r-square increase to 58%. Most ill-fitting results were due to premium
offerings, where respondents likely would not see MSRP as a good indicator of value.
Figure 1. Case Study 1 Comparison of Average Utilities and MSRP
Priorities one and two mentioned above seem to be met in this case study. The augmented
model gave us gave us estimates which we believe are closer to true preference, and our
augmented model better matches with the external check to MSRP. The third priority of getting
improved hit rates and MAEs with the augmented model proved more elusive with this data set.
The Augmented model did not significantly improve holdout hit rates or MAE (see Table 4). This
was a somewhat puzzling result. One explanation is that shares were very low in this study, in
the 1% to 5% range. The promotion attribute does not add any additive predictive value because
the hit rate is very high and has very little room for improvement. As a validation of this theory
99
we estimated a model without the promotion attribute, completely dropping it from the model.
We were still able to obtain 93% holdout hit rates in this model confirming that the promotion
attribute did not add any predictive power to our model. In a discrete choice model like this,
people’s choices might be largely driven by the top few attributes, yet when the market offerings
tie on those attributes, then mid-level attributes (like promotions in this study) will matter more.
Our main goal in this study was to get a more sensible and stable read on the promotion attribute
and not explicitly trying to improve the model hit rates. We are also not disappointed that the hit
rate and MAE were not improved because the hit rate and MAE showed a high degree of
accuracy in both instances.
Table 3. Case Study 1 Promotion Rank Orderings, Models, MSRP, Q-Sort
MSRP DCM Only DCM + Q-Sort
Rank
Rank
Rank
Promotion
$500 Gift Card
Tablet 1
Home Theatre Speakers
Mini Tablet
$300 Gift Card
Medium Screen TV
Gaming Console 2
E-reader
Gaming Console 1
Tablet 2
Small Screen TV
HD Headphones
3D Blu-Ray Player
Headphones
$100 Gift Card
12 month Gaming Subscription
1
2
3
4
5
6
7
8
8
8
8
8
8
14
15
16
5
2
7
1
4
9
15
12
13
6
8
14
11
10
3
16
1
3
7
5
2
4
14
9
12
8
13
15
10
11
6
16
Q-Sort
Rank
1
3
7
5
2
4
12
8
11
9
13
15
14
10
6
16
Table 4. Case Study 1 Comparison of Hit Rates, MAE, and Importances
DCM Only DCM + Q-Sort
93.4%
93.3%
Holdout hit rate
0.0243
0.0253
MAE
7.3%
14.5%
Average Importance
One problem with augmenting discrete choice data is that it often will artificially inflate the
importance of the augmented attribute relative to the non-augmented attributes. Our solution to
this problem was to scale back the importances to the original un-augmented model importance
at the individual level. We feel that there are also some other possible solutions that could be
100
investigated further. For example, we could apply a scaling parameter such that we scale the
augmented parameter as well as minimizing the MAE or maximizing the hit rate.
An alternative to augmenting for completely removing all reversals is to constrain at the
respondent level using MSRP information. Our constrained model maintained the 93% holdout
hit rates and comparable levels of MAE. The constrained model also deflated the average
importance of the promotion attribute to 4.8%. We thought it was better to augment in this case
study since there are additional trade-offs to be considered besides MSRP. For example, a
respondent might value a tablet at $300 but might prefer a product with a lower MSRP because
they already own a tablet.
CASE STUDY 2 STRUCTURE
We conducted a second case study which was also a discrete choice model in the telecom
space. The attributes included 16 distinct features that had binary (included/excluded) levels.
Other attributes in the study also included annual and monthly fees. After the choice task,
respondents were asked about their interest in using each feature (separately) on a 1–10 rating
scale. This external task took up to 2 minutes for respondents to complete. If respondents
answered 9 or 10 then we added extra scenarios to the regular choice scenarios. Each of these
extra scenarios was a binary comparison, each item vs. a “none” alternative.
CASE STUDY 2 RESULTS
As expected the rank ordering of the augmented task aligned slightly better than what we
would intuitively expect (rank orders shown in Table 5). As far as gauging to an external source,
this case study was a little bit more difficult than the previous one because we could not assign
something as straightforward as MSRP to the features. We looked at the top rated security
software products (published by a security software review company) and counted the number of
times the features were included in each and ranked them. Figure 2 shows this comparison. Here
we feel the need to emphasize key reasons not to constrain to an external source. First, often it is
very difficult to find external sources. Second, if an external source is found, it can be difficult to
validate and have confidence in. Last, even if there is a valid external source such as MSRP, it
still might not make sense to constrain given that there could be other value tradeoffs to consider.
Similar to the first case study, we did not see improved hit rates or MAEs in the second case
study. Holdout hit rates came out at 98% for both augmented and un-augmented models, and
MAEs were not statistically different from each other. We are not concerned with this nonimprovement because of the high degree of accuracy of both the augmented and non-augmented
models.
101
Table 5. Case Study 2 Rank Orders
Feature
Stated Rank DCM Only Rank
Order
Order
DCM + Stated
Rank Order
Feature 1
1
1
1
Feature 2
2
3
3
Feature 3
3
2
2
Feature 4
4
5
4
Feature 5
5
4
5
Feature 6
6
9
7
Feature 7
7
11
9
Feature 8
8
15
10
Feature 9
9
6
6
Feature 10
10
8
8
Feature 11
11
13
14
Feature 12
12
10
13
Feature 13
13
12
12
Feature 14
14
7
11
Feature 15
15
14
15
Feature 16
16
16
16
Figure 2. Case Study 2 External Comparison
DISCUSSION AND CONCLUSIONS
Augmenting a choice model with a Q-Sort or ratings battery can improve the model in the
following ways. First, the utility values are more logical and fit better with the respondents’ true
102
values for the attribute levels. Second, the utility values have a better fit with external sources of
value. It is not a given that holdout hit rates and MAE are improved with augmentation, although
we would hope that they would be in most conditions. We feel that our hit rates and MAE did not
improve in these cases because of the low likelihood of choice in the products we studied and the
already high pre-augmentation hit rates.
There are tradeoffs to consider when deciding to augment or constrain models. First, there is
added respondent burden in asking the additional Q-Sort or other exercise used for augmentation.
In our cases the extra information was collected in less than two additional minutes. Second,
there is additional modeling and analysis time spent to integrate the augmentation. In our cases
the augmented HB models took 15% longer to converge. Third, there is a tendency for the
attribute that is augmented to have inflated importances or sensitivities and we suggest scaling
the importances by either minimizing MAE or using the un-augmented importances. Lastly, one
should consider reliability of external sources to check the augmentation against or to use for
constraining.
Brent Fuller
Michael Smith
APPENDIX
Figure 3 shows an example of an appended augmented scenario from the first case study. In
scenario 13, item 15 was chosen as the highest-ranking item from the Q-Sort exercise. All other
attributes for the augmented tasks are coded as 0.
Figure 3. Example of un-augmented Coding matrix
Scenario Alternative
1
1
1
2
1
3
1
4
…
…
12
1
12
2
12
3
12
4
y
1
0
0
0
…
0
0
1
0
tv_2
0
1
0
0
…
0
0
1
-1
tv_3
0
0
0
0
…
0
0
0
-1
tv_4
0
0
1
0
…
1
0
0
-1
tv_5
1
0
0
0
…
0
0
0
-1
tv_6
0
0
0
1
…
0
1
0
-1
promo_1 promo_2 promo_3 promo_4 promo_5 promo_6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
…
…
…
…
…
…
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
…
…
…
…
…
…
…
…
…
…
promo_19
1
0
0
0
…
0
1
0
0
103
Figure 4. Example of Augmented Coding Matrix
Scenario Alternative
1
1
1
2
1
3
1
4
…
…
12
1
12
2
12
3
12
4
13
1
13
2
13
3
13
4
13
5
13
6
13
7
13
10
13
12
13
13
13
14
13
15
13
16
13
17
13
18
13
19
y
1
0
0
0
…
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
tv_2
0
1
0
0
…
0
0
1
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
tv_3
0
0
0
0
…
0
0
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
tv_4
0
0
1
0
…
1
0
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
tv_5
1
0
0
0
…
0
0
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
tv_6
0
0
0
1
…
0
1
0
-1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
promo_1 promo_2 promo_3 promo_4 promo_5 promo_6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
…
…
…
…
…
…
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
…
promo_19
1
0
0
0
…
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
REFERENCES
Hendrix, Phil and Drucker, Stuart (2007), “Alternative Approaches to MaxDiff with Large Sets
of Disparate Items—Augmented and Tailored MaxDiff” 2007 Sawtooth Software Conference
Proceedings, 169–187.
Lattery, Kevin (2009), “Coupling Stated Preferences with Conjoint Tasks to Better Estimate
Individual-Level Utilities” 2009 Sawtooth Software Conference Proceeding, 171–184.
104
MAXDIFF AUGMENTATION: EFFORT VS. IMPACT
URSZULA JONES
TNS
JING YEH
MILLWARD BROWN
BACKGROUND
In recent years MaxDiff has become a household name in marketing research as it is more
and more commonly used to assess the relative performance of various statements, products, or
messages. As MaxDiff grows in popularity, it is often called upon to test a large number of items;
requiring lengthier surveys in the form of more choice tasks per respondent in order to maintain
predictive accuracy. Oftentimes MaxDiff scores are used as inputs to additional analyses (e.g.,
TURF or segmentation), therefore a high level of accuracy for both best/top and worst/bottom
attributes is a must. Based on standard rules of thumb for obtaining stable individual-level
estimates, the number of choice tasks per respondent becomes very large as the number of items
to be tested increases. For example, 40 items requires 24–30 choice tasks per respondent
(assuming 4–5 items per task).
Yet at the same time our industry, and society in general, is moving at a faster pace with
decreasing attention spans; therefore necessitating shorter surveys to maintain respondent
engagement. Data quality suffers at 10 to 15 CBC choice tasks per respondent (Tang and
Grenville 2010). Researchers, therefore, find ourselves being pulled by two opposing demands—
the desire to test larger sets of items and the desire for shorter surveys—and faced with the
consequent challenge of balancing predictive accuracy and respondent fatigue.
To accommodate such situations, researchers have developed various analytic options, some
of which were evaluated in Dr. Ralph Wirth and Anette Wolfrath’s award winning paper “Using
MaxDiff for Evaluating Very Large Sets of Items: Introduction and Simulation-Based Analysis of
a New Approach.” In Express MaxDiff each respondent only evaluates a subset of the larger list
of items based on a blocked design and the analysis leverages HB modeling (Wirth and Wolfrath
2012). In Sparse MaxDiff each respondent sees each item less than the rule of thumb of 3 times
(Wirth and Wolfrath 2012). In Augmented MaxDiff, MaxDiff is supplemented by Q-Sort
informed phantom MaxDiff tasks (Hendrix and Drucker 2007).
Augmented MaxDiff was shown to have the best predictive power, but comes at the price of
significantly longer questionnaires and complex programming requirements (Wirth and Wolfrath
2012). Thus, these questions still remained:
1. Given the complex programming and additional questionnaire time, is Augmented
MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job?
2. If augmentation is valuable, how much is needed?
3. Could augmentation be done using only “best” items or should “worst” items also be
included?
105
CASE STUDY AND AUGMENTATION PROCESS
To answer these questions regarding Augmented MaxDiff, we used a study of N=676
consumers with chronic pain. The study objectives were to determine the most motivating
messages as well as the combination of messages that had the most reach.
Augmented MaxDiff marries MaxDiff with Q-Sort. Respondents first go through the
MaxDiff section per usual, completing the choice tasks as determined by an experimental design.
Afterwards, respondents complete the Q-Sort questions. The Q-Sort questions allow researchers
to ascertain additional inferred rankings on the tested items. The Q-Sort inferred rankings are
used to create phantom MaxDiff tasks, MaxDiff tasks that weren’t actually asked to respondents,
but researchers can infer from other data what the respondents would have selected. The
phantom MaxDiff tasks are used to supplement the original MaxDiff tasks and thus create a
super-charged CHO file (or response file) for utility estimation. See Figure 1 for an overview of
the process.
Figure 1: MaxDiff Augmentation Process Overview
In our case study, there were 46 total messages tested via Sparse MaxDiff using 13 MaxDiff
questions, 4 items per screen, and 26 blocks. Following the MaxDiff section, respondents
completed a Q-Sort exercise.
Q-Sort can be done in a variety of ways. In this case, MaxDiff responses for “most” and
“least” were tracked via programming logic and entered into the two Q-Sort sections—one for
“most” items and one for “least” items. The first question in the Q-Sort section for “most” items
showed respondents the statements they selected as “most” throughout the MaxDiff screens and
asked them to choose their top four. The second question in the Q-Sort section for “most” items
asked respondents to choose the top one from the top four.
The Q-Sort section for “least” items mirrored the Q-Sort section for “most” items. The first
question in the Q-sort section for “least” items showed respondents the statements they selected
106
as “least” throughout the MaxDiff screens and asked them to choose their bottom four. The
second question in the Q-Sort section for “least” items asked respondents to choose the bottom
one from the bottom four. See Figure 2 for a summary of the Q-Sort questions.
Figure 2: Summary of Q-Sort Questions
From the Q-Sort section on “most” items, for each respondent researchers know: the best
item from Q-Sort, the second best items from Q-Sort (of which there are three), and the third tier
of best items from Q-Sort (the remaining “most” items not selected in Q-Sort section for “most”
items). And from the Q-Sort section on “least” items, for each respondent researchers also know:
the worst item from Q-Sort, the second to worst items from Q-Sort (of which there are three),
and the third tier of worst items from Q-Sort (the remaining “least” items not selected in Q-Sort
section for “least” items). The inferred rankings from this data are custom for each respondent,
but at a high level we know:




The best item from Q-Sort (1 item) > All other items
The second best items from Q-Sort (3 items) > All other items except the best item from
Q-Sort
The worst item from Q-Sort (1 item) < All other items
The second to worst items from Q-Sort (3 items) < All other items except the least item
from Q-Sort
Using these inferred rankings, supplemental phantom MaxDiff tasks are created. Although
respondents were not asked these questions, their answers can be inferred, assuming that
respondents would have answered the new questions consistently with the observed questions.
107
Since the MaxDiff selections vary by respondent; the Q-Sort questions, the inferred rankings
from Q-Sort, and finally the phantom MaxDiff tasks are also customized to each respondent.
From respondents’ Q-Sort answers, we created 18 supplemental phantom MaxDiff tasks as
well as the inferred “most” (noted with “M”) and “least” (noted with “L”) responses. See Figure
3 for the phantom MaxDiff tasks we created.
Figure 3: Supplemental phantom MaxDiff tasks using both the Q-Sort section on “most”
items and the Q-Sort section on “least” items
Respondents’ Q-Sort answers are matched to the supplemental Q-Sort-Based MaxDiff tasks
(i.e., the phantom MaxDiff tasks) to produce a new design file that merges both the original
MaxDiff design and the Q-Sort supplements. Figure 4 illustrates the process in detail.
108
Figure 4: Generating a new design file that merges original MaxDiff with Q-Sort
supplements (i.e., phantom MaxDiff tasks)
Likewise respondents’ Q-Sort answers are used to infer their responses to the phantom
MaxDiff tasks to produce a new response file that merges both responses to the original MaxDiff
and the Q-Sort supplements. The new merged design and response files are combined in a
supercharged CHO file and used for utility estimation. Figure 5 provides an illustration.
109
Figure 5: Generating a new response file that merges original MaxDiff responses with
responses to Q-Sort supplements (i.e., phantom MaxDiff tasks)
EXPERIMENT
Our experiment sought to answer these questions:
1. Given the complex programming and additional questionnaire time, is Augmented
MaxDiff worth doing or is Sparse MaxDiff doing a sufficient job?
2. If augmentation is valuable, how much is needed?
3. Could augmentation be done using only “best” items or should “worst” items also be
included?
To answer these questions, the authors compared the model fit of Sparse MaxDiff with
Augmented MaxDiff when two types of Q-Sort augmentations are done:
Augmentation of best (or top) items only.
Augmentation of both best and worst (or top and bottom) items.
Recall that we generated supplemental phantom MaxDiff tasks using both the Q-Sort section
for the “most” items and the Q-Sort section for the “least” items (see Figure 3). When testing
augmentation including only “best” items we created supplemental phantom MaxDiff tasks using
only the Q-Sort section for the “most” items as shown in Figure 6. Again, responses for “most”
(noted with “M”) and “least” (noted with “L”) for each phantom tasks can be inferred.
110
Figure 6: Supplemental phantom MaxDiff tasks using only the Q-Sort section on “most”
items.
We also evaluated the impact of degree of augmentation on model fit by examining MaxDiff
Augmentation including 3, 5, 7, 9, and 18 supplemental phantom MaxDiff tasks.
FINDINGS
As expected, heavier augmentation improves fit. Heavier augmentation appends more Q-Sort
data and Q-Sort data is presumably consistent to MaxDiff data. Thus heavier augmentation
appends more consistent data and we therefore expected overall respondent consistency
measurements to increase. Percent Certainty and RLH are both higher for heavy augmentation
compared to Sparse MaxDiff (i.e., no augmentation) and lighter augmentation as shown in
Figure 7.
111
Figure 7: Findings from our estimation experiment
Surprisingly Best-Only Augmentation outperforms Best-Worst Augmentation even though
less information is used with Best-Only Augmentation (Best-Only % Cert=0.85 and RLH=0.81;
Best-Worst % Cert=0.80 and RLH=0.76).
To further understand this unexpected finding, we did a three-way comparison of (1) Sparse
MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only
Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst Augmentation
(“Q-Sort Best & Worst”). The results showed that at the aggregate level, the story is the same
regardless of whether augmentation is used. Spearman’s Rank Order correlation showed strong,
positive, and statistically significant correlations between the three indexed MaxDiff scores with
rs(46)>=0.986, p=.000 (see Figure 8). Note that for the two cases when augmentation was
employed for this test, we used 18 supplemental phantom MaxDiff tasks.
112
Figure 8: Spearmans Rank Order Correlation for indexed MaxDiff scores from (1) Sparse
MaxDiff without Augmentation (“No Q-Sort”) versus (2) Sparse MaxDiff with Best-Only
Augmentation (“Q-Sort Best Only”) versus (3) Sparse MaxDiff with Best-Worst
Augmentation (“Q-Sort Best & Worst”)
We further compared matches between (1) the top and bottom items based on MaxDiff scores
versus (2) top and bottom items based on Q-Sort selections. An individual-level comparison was
used to show the percent of times there is a match between Q-Sort top four items and MaxDiff
top four items as well as between Q-Sort bottom four items and MaxDiff bottom four items.
We found that at the respondent level the model is imprecise without augmentation. In
particular, the model is less precise at the “best” end compared to the “worst” end (45% match
on “best” items vs. 51% match on “worst”). In other words, researchers can be more sure about
MaxDiff items that come out at the bottom as compared to MaxDiff items that come out at the
top. This finding was consistent with results from a recent ART forum paper by Dyachenko,
Naylor, & Allenby (2013). The implication for MaxDiff Augmentation is that augmenting on best
items is critical due to the lower precision around those items.
CONCLUSIONS
As expected, at the respondent level Sparse MaxDiff is more imprecise compared to Sparse
MaxDiff with augmentation. However, at the aggregate level the results of Sparse MaxDiff are
similar to results with augmentation. Therefore, for studies where aggregate level results are
sufficient, Sparse MaxDiff is suitable. But for studies where stability around individual-level
estimates is needed, augmenting Sparse MaxDiff is recommended.
By augmenting on “best” items only, researchers can get a better return with a shorter
questionnaire and less complex programming compared to augmenting on both “best” and
“worst” items. MaxDiff results were shown to be less accurate at the “best” end and
augmentation on “best” items improved fit. Best-only augmentation requires a shorter
questionnaire compared to augmentation on both “best” and “worst” items.
Heavy augmenting whether augmenting on “best” only or both “best” and “worst” is critical
when other analyses (e.g., TURF, clustering) are required. The accuracy of utilities estimated
113
from heavy augmentation was superior to the accuracy of utilities estimated from lighter
augmentation.
Finally, if questionnaire real estate allows, obtain additional information from which the
augmentation can benefit. For example, in the Q-Sort exercise, instead of asking respondents to
select the top/bottom four and then the top/bottom 1; ask respondents to rank the top/bottom 4.
This additional ranking information allows more flexibility in creating the item combinations for
the supplemental phantom MaxDiff tasks, and we therefore hypothesize, better utility estimates.
Urszula Jones
Jing Yeh
REFERENCES
Dyachenko, Tatiana; Naylor, Rebecca; & Allenby, Greg (2013), “Models of Sequential
Evaluation in Best-Worst Choice Tasks,” 2013 Sawtooth Software Conference Proceedings.
Tang, Jane; Grenville, Andrew (2012), “How Many Questions Should You Ask in CBC
Studies?—Revisited Again,” 2010 Sawtooth Software Conference Proceedings, 217–232.
Wirth, Ralph; Wolfrath, Anette (2012), “Using MaxDiff for Evaluating Very Large Sets of Items:
Introduction and Simulation-Based Analysis of a New Approach,” 2012 Sawtooth Software
Conference Proceedings, 99–109.
114
WHEN U = BX IS NOT ENOUGH:
MODELING DIMINISHING RETURNS AMONG CORRELATED
CONJOINT ATTRIBUTES
KEVIN LATTERY
MARITZ RESEARCH
1. INTRODUCTION
The utility of a conjoint alternative is typically assumed to be a simple sum of the betas for
each attribute. Formally, we define utility U = βx. This assumption is very reasonable and robust.
But there are cases where U = βx is not enough, it is just too simple. One example of this, which
is the focus of this paper, is when the utilities of attributes in a conjoint study are correlated.
Correlated attributes can arise in many ways, but one of the most prevalent ways is with
binary yes/no attributes. An example of binary attributes appears below where the attributes
marked with an x mean that the benefit is being offered, and a blank means it is not.
Non Binary Attribute1
Non Binary Attribute2
Program 1
Level 1
Level 2
Discounts on equipment purchases
Access to online equipment reviews by
other members
Early access to new equipment
(information, trial, purchase, etc.)
Custom fittings
x
Members-only logo balls, tees, tags, etc
x
Program 2 Program 3
Level 3
Level 2
Level 2
Level 3
x
x
x
x
x
x
x
If the binary attributes are correlated, then adding more benefits does not give a consistent or
steady lift to utility. The result is that the standard U = βx model tends to over-predict interest
when there are more binary features (Program 1 above) and under-predict product concepts that
have very few of the features (Program 3 above). This can be a critical issue when running
simulations.
One common marketing issue is whether a lot of smaller cheaper benefits can compensate for
more significant deficiencies. Clients will simulate a baseline first product, and then a second
cheaper product that cuts back on substantial features. They will then see if they can compensate
for the second product’s shortcomings by adding lots of smaller benefits. We have seen many
cases where simulations using the standard model suggest a client can improve a baseline
product by offering a very inferior product with a bunch of smaller benefits. And even though
this is good news to the client who would love this to be true, it seems highly dubious even to
them.
115
The chart below shows the results of one research study. The x-axis is the relative number of
binary benefits shown in an alternative vs. the other alternative in its task. This conjoint study
showed two alternatives with only 1 to 3 binary benefits, so we simply take the difference in the
number of benefits shown. So showing alternative A with 1 benefit and alternative B with 3
benefits, we have (1 - 3) = -2 and (3 - 1) = 2. The difference in binary attributes is -2 and +2. The
vertical axis is corresponding error, Predicted Share—Observed Share, for the in-sample tasks.
This kind of systematic bias, where we overstate the share as we add more benefits, is
extremely problematic when it occurs. While this paper focuses on binary attributes, the problem
occurs in other contexts, and the new model we propose can be used in those as well. This paper
also discusses some of the issues involved when we change the standard conjoint model to
something other than U = βx. In particular, we discuss how it is not enough to simply change the
functional form when running HB. Changing the utility function also requires other significant
changes in how we estimate parameters.
2. LEARNING FROM CORRELATED ALTERNATIVES
There is not much discussion in the conjoint literature about correlated attributes, but there
is a large body of work on correlated alternatives. Formally, we talk the “Independence of
Irrelevant Alternatives” (IIA) assumption, where the basic logit model assumes the alternatives
are not correlated. The familiar blue bus/red bus example is an example of correlated
alternatives. Introducing a new bus that is the same as the old bus with a different color creates
correlation among the two alternatives: red bus and blue bus are correlated. The result is parallel
to what we described above for correlated attributes: the model over-predicts how many people
ride the bus. In fact, if we keep adding the same bus with different colors—yellow, green,
fuchsia, etc.—the basic logit model will eventually predict that nearly everyone rides the bus.
116
One of the classic solutions to correlated alternatives is Nested Logit. With Nested Logit,
alternatives within a nest are correlated with one another. In the case of buses, one would group
the various colors of buses together into one nest of correlated alternatives. The degree of
correlation is specified by an additional λ parameter. When λ=1, there is no correlation among the
alternatives in a nest, and as λ goes to 0, the correlation increases. One solves for λ, given the
specific data. In the case of red bus/blue bus, we expect λ near 0.
The general model we propose follows the mathematical structure of nested logit, but instead
of applying the mathematical structure to alternatives, we apply it to attributes. This means we
think of the attributes as being in nests, grouping together those attributes that are correlated, and
measuring the degree of that correlation by introducing an additional λ parameter. In our case, we
are working at the respondent level, and each respondent will have their own λ value. For some
respondents, a nest of attributes may have little correlation, while for others the attributes may be
highly correlated.
Recall that the general formulation for a nested logit, where the nest is composed of
alternatives 1 through n with non-exponentiated utilities of U1 . . . Un is:
eU = [(eU1) 1/λ + (eU2) 1/λ + ... + (eUn)1/λ ]λ
where 0 < λ <=1, and 1 - λ can be interpreted as the correlation among the alternatives.
and U gives the overall utility of the nest as a whole, which is then used in the multinomial logit
in the usual way. We can adapt this formulation to create a nest of attributes in the following way.
Consider a defined nest of n attributes with betas B1 . . . Bn, and corresponding indicators x1 . . .
xn (each indicator being 1 if the attribute applies to an alternative, and zero otherwise). Then the
standard utility for the group of attributes is:
U = [(x1B1) + (x2B2) + ... + (xnBn)]
and the new nested-like formulation is:
U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ
Each set of attributes that is grouped together in a nest would have its own λ parameter. For
instance, we might group attributes 1–5 together in one nest, and attributes 6–9 in another. We
would then compute the new utility U for each nest separately and add them. We could even
employ hierarchies of nests as we do with nested logit. One important caveat is that each Bi>=0.
We are talking about diminishing returns, so we are assuming each beta is positive, but the return
may diminish when it is with other attributes in the same nest. (If the betas were not all positive
the idea of diminishing returns would not make any sense.)
This new formulation has several excellent properties, which it shares with nested logit.
1) When λ =1, the formulation reduces to the standard utility. This is also its maximum
value.
2) As λ shrinks towards 0, the total utility also shrinks. At the limit of λ =0, the total utility
is just the utility of the attribute in the nest with the greatest value.
3) The range of utility values is from the utility of the single best attribute to the simple sum
of the attributes. We are shrinking the total utility from the simple sum down to an
amount that is at least the single highest attribute.
117
4) Adding a new attribute (changing xi from 0 to 1) will always add some amount of utility
(though it could be very close to 0). It is not possible to have reversals, where adding an
item shrinks the total utility down lower than it was before adding the item.
5) The amount of shrinkage depends upon the size of the betas and their relative difference.
Betas that are half as large will shrink half as much. In addition, the amount of shrinkage
also depends upon their relative size. Three nearly equal betas will shrink differently than
3 where one is much smaller.
Before turning to the results using this nested attribute formulation, I want to briefly describe
and evaluate some other methods that have been proposed to deal with the problem of
diminishing returns among correlated attributes.
3. ALTERNATIVE SOLUTIONS TO DIMINISHING RETURNS
There are three alternative methods I will discuss here, which other practitioners have shared
with me in their efforts to solve the problem of diminishing returns. Of course, there are likely
other methods as well. The advantage of all the alternative methods discussed here is that they do
not require one to change the underlying functional form, so they can be easily implemented,
with some caveats.
3.1 Recoding into One Attribute with Many Levels
One method for modeling a set of binary attributes is to code them into one attribute with
many levels. For example, 3 binary attributes A, B, C can be coded into one 7-level attribute: A,
B, C, AB, AC, BC, ABC (or an 8-level, if none is included as a level). This gives complete
control over how much value each combination of benefits has. To implement it, you would also
need to add six constraints:
–
–
–
–
–
–
AB > A
AB > B
AC > A
AC > C
ABC > AB
ABC > AC
With only 2–3 attributes, this method works fine. It adds quite a few more parameters to
estimate, but is very doable. When we get to 4 binary attributes, we need to make a 15- or 16level attribute with 28 constraints. That is more levels and constraints than most will feel
comfortable including in the model. With 5 attributes, things are definitely out of hand, requiring
an attribute with 31 or 32 levels and many constraints.
While this method gives one complete control over every possible combination, it is not very
parsimonious, even with 4 binary attributes. My recommendation is to use this approach only
when you have 2–3 binary attributes.
3.2 Adding Interaction Effects
A second approach is to add interaction terms. For example, if one is combining attribute A
and B, the resulting utility is A + B - AB, where the last term is an interaction value that
represents the diminishing returns. When AB is 0, there is no diminishing.
118
It is not clear how this works with more attributes. With 3 binary attributes A, B, C, the total
utility might be A + B + C - (AB + AC + BC + ABC). But notice that we have lots of interactions
to subtract. When we get to 4 attributes, it is even more problematic, as we have 11 interactions
to subtract out. If we use all possible interactions, we wind up with one degree of freedom per
level, just as in the many-leveled attribute approach, so this method is like that of 3.1 in terms of
number of parameters. One might reduce the number of parameters by only subtracting the
pairwise interactions, imposing a little more structure on the problem. Even pairs alone can still
be quite a few interactions, however.
The bigger problem with this method is how to make the constraints work. We must be
careful not to subtract out too much for any specific combination. Otherwise, it will appear that
we add a benefit, and actually reduce the utility, a “reversal.” This means trying to set up very
complex constraints. So we need constraints like (A + B + C) > (AB + AC + BC + ABC). Each
interaction term will be involved in many such constraints. Such a complex constraint structure
will most likely have a negative impact on estimation. I see no value in this approach, and
recommend the previous recoding option in 3.1 as much easier to implement.
3.3 Adding One or More Variables for Number of Items
The basic idea of this approach is like that of subtracting interactions, but using a general
term for the interactions based on the number of binary items, rather than estimating each
interaction separately. So for two attributes A and B, the total utility would be:
A + B - k, and for 3 attributes it might be
A + B + C - 2k
In this case, k represents the amount we are subtracting for each additional benefit. With 5
benefits, we would be subtracting 4k. The modeling involves adding an additional variable to the
design code. This new variable counts the number of binary attributes in the alternative. So an
alternative with 5 binary attributes has a new variable with a value of 4 (we use number of
attributes - 1). The value k can be estimated just like any other beta in the model, constrained for
the proper sign.
One can even generalize the approach by making it non-linear, perhaps adding a squared
term for instance. Or we can make the amount subtracted vary for each number of benefits. So
for two attributes:
A + B - m, and for 3 attributes it might be
A+ B+ C - n
In this very general case, we model a specific value for each number of binary attributes,
with constraints applied so that we subtract out more as the number of attributes increases (so n >
m above). The number of additional parameters to estimate is the range of the number of
attributes in an alternative (minus 1). So if the design shows alternatives ranging from 1 binary
benefit to 9, we would estimate 8 additional parameters, ordinally constrained.
I think this approach is very clever, and in some cases it may work. The general problem I
have with it is that it can create reversals. Take the simple case of two binary attributes A + B - k.
In that case, k should be smaller than A or B. The same applies to different pairs of attributes. In
the case of C + D - k, we want k to also be less than C or D. In general we want k to be less than
the minimum value of all the binary attributes. Things get even more complicated when we turn
119
to 3 attributes. In the end, I’m not certain these constraints can be resolved. One may just have to
accept reversals. In the results section, I will show the degree to which reversals occurred using
this method in my data sets. This is not to say that the method would never be an improvement
over the base model, but one should be forewarned about the possibility of reversals.
Another disadvantage of this method is that shrinkage is purely a function of the number of
items. Given 3 items, one will always subtract out the same value regardless of what those items
are. In contrast, the nested logit shrinks the total utility based on the actual betas. So with the
nested formulation, 3 items with low utility are shrunk less than 3 items with high utility, in
proportion to their scale, based on their relative size to each other, and the shrinkage cannot
exceed the original effects. Basing shrinkage on the actual betas is most likely better than
shrinkage based solely on the number of items. That said, making shrinkage a function of the
betas does impose more difficulty in estimation—a topic we turn to in the next section.
4. ESTIMATION OF NESTED ATTRIBUTE FORMULATION (LATENT CLASS ENSEMBLES AND
HB)
Recall that the proposed formula for a nest of attributes 1 through n, with corresponding
betas Bi and indicators xi is:
U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ , where λ ϵ (0,1], Bi >=0
In the results section, we estimated this model using Latent Class Ensembles. This method
was developed by Kevin Lattery and presented in detail in an AMA ART Forum 2013 paper,
“Improving Latent Class Conjoint Predictions With Nearly Optimal Cluster Ensembles.” As we
mention there, the algorithm is run using custom SAS IML code. We give a brief description of
the method below.
Latent Class Ensembles is an extension of Latent Class. But rather than one Latent Class
solution, we develop many Latent Class solutions. This is easy to do because Latent Class is
subject to local optima. By using many different starting points and relaxing the convergence
criteria (max LL/min LL > .999 for last 10 iterations), we create many different latent class
solutions. These different solutions form an ensemble. Each member of the ensemble (i.e., each
specific latent class solution) gives a prediction for each respondent. We average over those
predictions (across the ensemble) to get the final prediction for each respondent. So if we
generate 30 different Latent Class solutions, we get 30 different predictions for a specific
respondent, and we average across those 30 predictions. This method significantly improves the
predictive power versus a single Latent Class solution.
The primary reason for using Latent Class Ensembles is that the model was easier to
estimate. One of the problems with nested logit is that it is subject to local maxima. So it is
common to estimate nested logits in three steps:
1) Assume λ =1, and estimate the betas,
2) Keep betas fixed from 1), and estimate λ,
3) Using betas and λ from 2) as starting points to estimate both simultaneously.
We employed this same three-step process in the latent class ensemble approach. So each
iteration of the latent class algorithm required three regressions for one segment, rather than one.
But other than that, we simply needed to change the function and do logistic regression. One
120
additional wrinkle is that we estimated 1/ λ rather than λ. We then set the constraint on 1/ λ to
range from 1 to 5. Using the inverse broadens the range of what is being estimated, and makes it
easier to estimate than the equivalent λ ranging from .2 to 1. In some cases I have capped 1/ λ at
10 rather than 5. If one allows 1/ λ to get too large, it can cause calculation overflow errors. At .2,
the level of shrinkage is quite close to that one would find at the limit of 0.
Our intent was to also use the nested formulation within HB. That however proved to be
more difficult. Simply changing the functional form and constraining 1/ λ did not work at all. HB
did not converge, and any result we tried performed more poorly than the traditional non-nested
formula. We tried estimation first using our Maritz specific HB package in SAS and then also
using the R package ChoiceModelR. Both of these had to be modified for the custom function.
One problem is that λ is not a typical parameter like the betas. Instead, λ is something we
apply to the betas in exponential form. So within HB, λ should not be included in the covariance
matrix of betas. In addition, we should draw from a separate distribution for λ, including a
separate jumping factor in the Gibbs sampler. This then raises the question of order of estimation,
especially given the local optima that nested functions often have. My recommendation is as
follows:
1) Estimate the betas without λ (assume λ =1)
2) Estimate 1/λ using its own 1-attribute variance matrix, assuming fixed betas from above
3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance
matrix in 2)
We have not actually programmed this estimation method, but offer it as a suggestion for
those who would like to use HB. It is clear to us that one needs, at the very least, separate draws
and covariance matrices for betas and 1/λ. The three steps above recognize that and adopt the
conservative null hypothesis that λ will be 1, moving away from 1 only if the data support that.
The three steps parallel the three steps taken in the Latent Class ensembles, and the procedure of
nested logit more generally: sequential, then simultaneous. My recommendation would be to first
do step 1 across many MCMC iterations until convergence, estimating betas only for several
thousand iterations. Then do step 2 across many MCMC iterations, estimating λ. Finally, do step
3, which differs from the first two steps in that each iteration estimates new betas and then new
λs.
5. EMPIRICAL RESULTS
Earlier we showed the results for the standard utility HB model. When we apply the nested
attribute utility function (estimated with Latent Class ensembles), we get the following result. We
show the new nested method just to the right of the standard method that was shown in the
original figure:
121
Nested
Model
Standard
Utility
The nested formulation not only shows a reduction in error, but removes most of the number
of item bias. The error is now almost evenly distributed across the relative number of items.
The data set above is our strongest case study of the presence of item bias. We believe there a
few reasons for this, which make this data set somewhat unique. First, this study only measured
binary attributes. Most conjoint studies have a mixture of binary and non-binary attributes. When
there is error in the non-binary attributes, the number of attributes bias is less apparent.
A second reason is that this study showed exactly two alternatives. This makes it easy to plot
the difference in the number of items in each task. If alternative A showed 4 binary attributes and
alternative B showed 3 binary attributes, the difference is +1 for A and -1 for B. In most of our
studies we have more than 2 alternatives. So the relative number of binary attributes is more
complicated. The next case study had a more complex design, with 3 alternatives and both binary
and non-binary attributes. Here is a sample task:
122
Points Earned
Room Upgrades
Food & Beverage Discount
Frequent Flyer Miles
Additional Benefits
Loyalty Program 1
100 pts per $100
None
None
1,500 miles per stay
• Early check-in and
extended check-out
Loyalty Program 2
Loyalty Program 3
200 pts per $100
400 pts per $100
2 per year
Unlimited
10% Off
15% Off
500 per stay
1,000 miles per stay
• 2 one-time guest passes • Complimentary
to an airport lounge
welcome gift/snack
• Turndown service
• Gift shop discount
• Priority check-in and
express check-out
• Complimentary breakfast
• No blackout dates for
reward nights
In this case, we estimated the relative number of binary attributes by subtracting from the
mean. The mean number of binary attributes in the example above is (5+2+1)/3 = 2.67. We then
subtract the mean from the number of binary attributes in each alternative. So alternative 1 has
5 - 2.67 = 2.33 more binary attributes than average. Alternative 2 has 2 - 2.67 = -.67, and
alternative 3 has 1 - 2.67 = -1.67. Note that these calculations are not used in the modeling; we
are only doing them to facilitate plotting the bias in the results.
The chart below shows the results for each in-sample task. The correlation is .33, much
higher than it should be; ideally, there should be no relationship here.
The slope of the line is 2.4%, which means that each time we add an attribute we are
overstating share by an average of 2.4%. This assumes a 3-alternative scenario. With two
alternatives that slope would likely be higher. Clearly there is systematic bias. It is not as clean as
123
the first example, in part because we have noise from other attributes as well. Moreover, we
should note that this is in-sample share. Ideally, one would like out-of-sample tasks, which we
expect will show even more bias than above. If you see the bias in-sample it is even more likely
to be found out-of-sample.
Using the new model, the slope of the line relating relative number of attributes to error in
share is only 0.9%. This is much closer to the desired value of 0. We also substantially reduced
the mean absolute error, from 5.8% to 2.6%.
The total sample size for this case study was 1,300. 24.7% of those respondents had a λ of 1,
meaning no diminishing returns for them. 34.7% had a λ of .2, the smallest used in this model.
The remaining 40.7% had a median of .41, so the λ values skewed lower. The chart below shows
the cumulative percentage of λ values, flipped at the median. Clearly there were many λ values
significantly different from 1.
124
Cumulative Distribution of Estimated λ Values
(Flipped at the Median; So Cumulative from Each End to the Middle)
We also fit a model using the method discussed in section 3.3, adding a variable equal to the
number of binary attributes minus 1. This offered a much smaller improvement. The slope of the
line bias was reduced from 2.4% to 1.8%. So if one is looking for a simple approach that may
offer some help in reducing systematic error, the section 3.3 approach may be worth considering.
However, we remain skeptical of it because of the problem of reversals, which we discuss below.
One drawback to the number of binary attributes method is reversals. These occur when the
constant we are subtracting is greater than the beta for an attribute. Properly fixing these
reversals is extremely difficult. One attribute, Gift Shop Discount, showed a reversal at the
aggregate level. On its own, it added benefit, but when you added it to any other existing binary
attribute the predicted result was lower. Clearly, this would be counterintuitive, and need to be
fixed in such a model.
It turns out that every one of the 24 binary attributes had a reversal for some respondents,
using the method in 3.3. In addition to “Gift Shop Discounts,” two other attributes had reversals
for over 40% of the respondents. This is clearly an undesirable level of reversals, and could show
up as reversals in aggregate calculations for some subgroups. For this reason, we remain
skeptical of using this method. The nested logit formulation never produces reversals.
6. NOTES ON THE DESIGN OF BINARY ATTRIBUTES
While the focus of this paper has been on the modeling of binary attributes, there are a few
crucial comments I need to make about the experimental design with sets of binary attributes.
After presenting initial versions of this paper, one person exclaimed that given a set of say 8
binary attributes, they like to show the same number of binary attributes in each alternative and
task. For example, the respondent will always see 3 binary attributes in each alternative and task.
This makes the screen look perfectly square, and gives the respondent a consistent experience.
Of course, it also means you won’t notice any number of attribute bias. But it doesn’t mean the
bias is not there; we are just in denial.
In addition to avoiding (but not solving) the issue, showing the same number of binary
attributes is typically an absolutely terrible idea. If one always shows 3 binary attributes, then
125
your model can only predict what happens if each alternative has the same number of binary
attributes. The design is inadequate to make any other predictions. You really don’t know
anything about 2 binary vs. 3 binary vs. 4 binary, etc.
To explain this point, consider the following 3 alternative tasks, with the betas shown in the
cells:
Non-Binary1
Non-Binary2
Binary 1
Binary 2
Binary 3
Binary 4
Binary 5
Utility
Exponentiated U
Probability
Program 1
1.0
1.5
2.0
0.5
0.2
Program 2
-3.0
1.5
2.0
1.2
0.8
5.2
181.3
81.1%
2.5
12.2
5.5%
Program 3
2.0
-0.5
0.5
0.2
1.2
3.4
30.0
13.4%
Now what happens if we add 2 to each binary attribute? We get the values below:
Non-Binary1
Non-Binary2
Binary 1
Binary 2
Binary 3
Binary 4
Binary 5
Utility
Exponentiated U
Probability
Program 1
1.0
1.5
4.0
2.5
2.2
Program 2
-3.0
1.5
4.0
3.2
2.8
11.2
73,130
81.1%
8.5
4,915
5.5%
Program 3
2.0
-0.5
2.5
2.2
3.2
9.4
12,088
13.4%
The final predictions are identical! In fact, we can add any constant value to each binary
attribute and get the same prediction. So given our design, each binary attribute is really Beta +
k, where k can be anything. In general this is really bad because the choice of k makes a big
difference when you vary the number of attributes. Comparing a 2-item alternative with a 5
alternative item results in a 3k difference in utilities, which leads to a big difference in estimated
shares or probabilities. The value of k matters in the simulations, but the design won’t let us
estimate it! The only time this design is not problematic is when you keep the number of binary
attributes the same in every alternative for your simulations as well as in the design.
To do any modeling on how a mixed number of binary attributes works, the design must also
have a mixed number of binary attributes. This is true entirely independently of whether we’re
trying to account for diminishing returns. In general, we recommend you have as much
126
variability in your design as you want to simulate. Sometimes this is not reasonable, but it should
always be the goal.
One of the more common mistakes conjoint designers make is that they will let each binary
attribute be randomly present or absent. So if one has 8 binary attributes, most of the alternatives
have about 4 binary attributes. Very few, if any tasks, will show 1 binary attribute vs. 8 binary
attributes. But it is important that the design show this range of 1 vs. 8 if we are to accurately
model this extreme difference. My recommendation is to create designs where the range of
binary attributes varies from extreme to small, and evenly distribute tasks across this range. For
example, each respondent might see 3 tasks with extreme differences, 3 with somewhat extreme,
3 moderate, and 3 with minimal differences. The key is to get different levels of contrast in the
relative number of binary attributes if one wants to detect and model those kinds of differences.
7. CONCLUSION
The correlation of effects among attributes in a conjoint study means that our standard U =
βx may not be adequate. One way to deal with that correlation is to borrow the formulation of
nested logit, which was meant to deal with correlated alternatives. More specifically, the utility
for a nest of attributes 1 . . . n was defined as:
U = [(x1B1)1/λ + (x2B2)1/λ + ... + (xnBn)1/λ ]λ
Employing that formulation in HB has its challenges, as one should specify separate draws
and covariance matrices for the betas and λ. We recommended a three stage approach for HB
estimation:
1) Estimate betas without λ (assume λ =1)
2) Estimate 1/λ using its own 1 attribute variance matrix, assuming fixed betas from above
3) Estimate betas and 1/λ starting with covariance matrix of betas in 1) and 1/λ covariance
matrix in 2)
While we did not validate the nested logit formulation in HB, we did test a similar three step
approach using the methodology of latent class ensembles: sequential estimation of betas,
followed by estimation of λ and then a simultaneous estimation. The nested attribute model
estimated this way significantly reduced the overstatement of share that happens with the
standard model when adding correlated binary attributes.
In this paper we have only discussed grouping all of the binary attributes together in one nest.
But of course, that assumption is most likely too simplistic. In truth, some of the binary attributes
may be correlated with each other, while others are not. Our next steps are to work on ways to
determine which attributes belong together, and whether that might vary by respondent.
There are several possible ways one might group attributes together. Judgment is one
method, based on one’s conceptual understanding of which attributes belong together. In some
cases, our judgment may even structure the survey design, as in this example:
127
Another possibility is empirical testing of different nesting models, much like the way one
tests different path diagrams in PLS or SEM. We also plan to test rating scales. By asking
respondents to rate the desirability of attributes we can use correlation matrices and variable
clustering to determine which attributes should be put into a nest with one another. As noted in
the paper, one might even create hierarchies of nests, as one does with nested logits.
We have just begun to develop the possibilities of nested correlated attributes, and welcome
further exploration.
Kevin Lattery
128
RESPONDENT HETEROGENEITY, VERSION EFFECTS OR SCALE?
A VARIANCE DECOMPOSITION OF HB UTILITIES
KEITH CHRZAN,
AARON HILL
SAWTOOTH SOFTWARE
INTRODUCTION
Common practice among applied marketing researchers is to analyze discrete choice
experiments using Hierarchical Bayesian multinomial logit (HB-MNL). HB-MNL analysis
produces a set of respondent-specific part-worth utilities which researchers hope reflect
heterogeneity of preferences among their samples of respondents. Unfortunately, two other
potential sources of heterogeneity, version effects and utility magnitude, could create preferenceirrelevant differences in part-worth utilities among respondents. Using data from nine
commercial choice experiments (provided by six generous colleagues) and from a carefully
constructed data set using artificial respondents, we seek to quantify the relative contribution of
version effects and utility magnitude on heterogeneity of part-worth utilities. Any heterogeneity
left unexplained by these two extraneous sources may represent real differences in preferences
among respondents.
BACKGROUND
Anecdotal evidence and group discussions at previous Sawtooth Software conferences have
identified version effects and differences in utility magnitudes as sources of preference-irrelevant
heterogeneity among respondents.
Version effects occur when respondents who receive different versions or blocks of choice
questions end up with different utilities. One of the authors recalls stumbling upon this by
accident, using HB utilities in a segmentation study only to find that the needs-based segment a
respondent joined depended to a statistically significant extent on the version of the conjoint
experiment she received.
We repeated these analyses on the first five data sets made available to us. First we ran
cluster analysis on the HB utilities using the ensembles analysis in CCEA. Crosstabulating the
resulting cluster assignments by version numbers we found significant 2 statistics in three of the
five data sets. Moreover, of the 89 utilities estimated in the five models, analysis of variance
identified significant F statistics for 31 of them, meaning they differed by version.
Respondents may have larger or smaller utilities as they answer more or less consistently or
as their individual utility model fits their response data more or less well. The first data set made
available to us produced a fairly typical finding: the respondent with the most extreme utilities
(as measured by their standard deviation) had part-worths 4.07 times larger than those for the
respondent with the least extreme utilities. As measures of respondent consistency, differences in
utility magnitude are one (but not the only) manifestation of the logit scale parameter (Ben Akiva
and Lerman 1985). Recognizing that the response error quantified by the scale parameter can
129
affect utilities in a couple of different ways, we will refer to this effect as one of utility
magnitude rather than of scale.
With evidence of both version effects and utility magnitude effects, we undertook this
research to quantify how much of between-respondent differences in utilities they explain. If it
turns out that these explain a large portion of the differences we see in utilities across
respondents we will have to question how useful it is to have respondent-level utilities:


If much of the heterogeneity we observe owes to version effects, perhaps we should keep
our experiments small enough (or our questionnaire long enough) to have just a single
version of our questionnaire, to prevent version effects, for example; of course this raises
the question of which one version would be the best one to use;
If differences in respondent consistency explain much of our heterogeneity then perhaps
we should avoid HB models and their illusory view of preference heterogeneity.
If these two factors explain very small portions of observed utility heterogeneity, however, then
we can be more confident that our HB analyses are measuring preference heterogeneity.
VARIANCE DECOMPOSITION OF COMMERCIAL DATA SETS
In order to decompose respondent heterogeneity into its components we needed data sets with
more than a single version, but with enough respondents per version to give us statistical power
to detect version effects. Kevin Lattery (Maritz Research), Jane Tang (Vision Critical), Dick
McCullough (MACRO Consulting), Andrew Elder (Illuminas), and two anonymous contributors
generously provided disguised data sets that fit our requirements.
The nine studies differed in terms of their designs, the number of versions and of respondents
they contained:
Study
1
2
3
4
5
6
7
8
9
Design
5 x 43 x 32
23 items/14 quints
15 items/10 quads
12 items/9 quads
47 items/24 sextuples
90 items/46 sextuples
8 x 42 x 34 x 2
52 x 42 x 32
2
4 x 35 x 2 + NONE
Number of versions
10
8
6
8
4
6
2
10
6
Total sample size
810
1,000
2,002
1,624
450
527
5,701
148
1,454
For each study we ran the HB-MNL using CBC/HB software and using settings we expected
practitioners would use in commercial applications. For example, we ran a single HB analysis
across all versions, not a separate HB model for each version; estimating a separate covariance
matrix for each version could create a lot of additional noise in the utilities. Or again, we ran the
CBC/HB model without using version as a covariate: using the covariate could (and did)
exaggerate the importance of version effects in a way a typical commercial user would not.
With the utilities and version numbers in hand we next needed a measure of utility
magnitude. We tried four different measures and decided to use the standard deviation of a
130
respondent’s part-worth utilities. This measure had the highest correlations with the other
measures; all were highly correlated with one another, however, so the results reported below are
not sensitive to this decision.
For variance decomposition we considered a variety of methods but they all came back with
very similar results. In the end we opted to use analysis of covariance (ANCOVA). ANCOVA
quantifies the contribution of categorical (version) and continuous (utility magnitude) predictors
on a dependent variable (in this case a given part-worth utility). We ran the ANCOVA for all
utilities in a given study and then report the average contribution of version and utility magnitude
across the utilities as the result for the given study. The variance unexplained by either version
effect or utility magnitude may owe to respondent heterogeneity of preferences or to other
sources of heterogeneity not included in the analysis. Because of an overlap in the explained
variance from the two sources we ran an averaging-over-orderings ANCOVA to distribute the
overlapping variance.
So what happened? The following table relates results for the nine studies and for the average
across them.
Study
Design
1
2
3
4
5
6
7
8
9
5 x 43 x 32
23 items/14 quints
15 items/10 quads
12 items/9 quads
47 items/24 sextuples
90 items/46 sextuples
8 x 42 x 34 x 2
52 x 42 x 32
42 x 35 x 2 + NONE
Mean
Number
of
versions
10
8
6
8
4
6
2
10
6
Total
sample
size
810
1,000
2,002
1,624
450
527
5,701
148
1,454
Variance
(magnitude)
(%)
6.2
8.0
6.4
10.3
7.9
17.0
7.3
27.8
6.7
Variance
(version)
(%)
4.9
1.1
0.1
1.1
0.7
1.2
0
3.3
1.1
Variance
(residual)
(%)
88.9
90.9
93.5
88.6
91.4
81.8
92.7
68.9
92.2
10.8
1.4
87.8
On average, the bulk of heterogeneity owes neither to scale effects nor to utility magnitude
effects. In other words, up to almost 90% of measured heterogeneity may reflect preference
heterogeneity among respondents. Version has a very small effect on utilities—statistically
significant, perhaps, but small. Version effects account for just under 2% of utility heterogeneity
on average and never more than 5% in any study.
Heterogeneity owing to utility magnitudes explains more—about 11% on average and as
high as 27.8% in one of the studies (the subject of that study should have been highly engaging
to respondents—they were all in the market for a newsy technology product; the same level of
engagement should also have been present in study 7, however, based on a very high response
rate in that study and the importance respondents would have placed on the topic). In other
words, in some studies a quarter or more of observed heterogeneity may owe to the magnitude
effects that reflect only differences in respondent consistency. Clearly magnitude effects merit
attention and we should consider removing them when appropriate (e.g., for reporting of utilities
but not in simulators) through scaling like Sawtooth Software’s zero-centered diffs.
131
IS THERE A MECHANICAL SOURCE OF THE VERSION EFFECT?
It could be the version effect is mechanical—something about the separate versions itself
causes the effect. To test this we looked at whether the version effect occurs among artificial
respondents with additive, main-effects utilities: if it does, then the effect is mechanical and not
psychological.
Rather than start from scratch and create artificial respondents who might have patterns of
preference heterogeneity unlike those of humans, we started with utility data from human
respondents. For a worst-case analysis, we used the study with the highest contribution from the
version effect, study 1 above. We massaged these utilities gently, standardizing each respondent’s
utilities so that all respondents had the same magnitude of utilities (the same standard deviation
across utilities) as the average of human respondents in study 1. In doing this we retain a realistic
human-generated pattern of heterogeneity, at the same time removing utility magnitude as an
explanation for any heterogeneity. Then we added logit choice rule-consistent independently,
identically distributed (i.i.d.) errors from a Gumbel distribution to the total utility of each
alternative in each choice set for each respondent. We had our artificial respondents choose the
highest utility alternative in each choice set and choice sets constituted the same versions the
human respondents received in Study 1. Finally, we ran HB-MNL to generate utilities.
When we ran the decomposition described above on the utilities the version effects virtually
disappeared, with the variance explained by version effects falling from 4.9% of observed
heterogeneity for human respondents to 0.04% among the artificial respondents. Thus a
mechanical source does not explain the version effect.
DO CONTEXT EFFECTS EXPLAIN THE VERSION EFFECT?
Perhaps context effects could explain the version effect. Perhaps some respondents see some
particular levels or level combinations early in their survey and they answer the remainder of the
survey differently than do respondents who saw different initial questions.
At the conference the suggestion came up that we could investigate this by checking whether
the version effect differs in studies wherein choice tasks appear in random orders versus those in
which choice tasks occur in a fixed order. We went back to our nine studies and found that three
of them showed tasks in a random order while six had their tasks asked in a fixed order within
each version. It turned out that studies with randomized task orders had slightly smaller version
effects (explaining 0.7% of observed heterogeneity) than those with tasks asked in a constant
order in each version (explaining 1.9% of observed heterogeneity). So the order in which
respondents see choice sets may explain part of the small version effect we observe. Knowing
this we can randomize question order within versions to obscure, but not remove, the version
effect.
CONCLUSIONS/DISCUSSION
The effect of version on utilities is often significant but invariably small. It accounts for an
average under 2% of total observed heterogeneity in our nine empirical studies and in no case as
high as 5%. A much larger source of variance in utilities owes to magnitude differences, an
average of almost 11% in our studies and nearly 28% in one of them. The two effects together
132
account for about 12% of total observed heterogeneity in the nine data sets we investigated and
thus do not by themselves explain more than a fraction of the total heterogeneity we observe.
We would like to say that this means that the unexplained 88% of variance in utilities
constitutes true preference heterogeneity among respondents but we conclude more cautiously
that true preference heterogeneity or as yet unidentified sources of variance explain the
remaining 88%. Some of our simulation work points to another possible culprit: differences in
respondent consistency (reflected in the logit scale factor) may have both a fixed and a random
component to their effect on respondent-level utilities. Our analysis, using utility magnitude as a
proxy, attempts to pull out the fixed component of the effect respondent consistency has on
utilities but it does not capture the random effect. This turns out to be a large enough topic in its
own right to take us beyond the scope of this paper. We think it best left as an investigation for
another day.
Keith Chrzan
Aaron Hill
REFERENCES
Ben-Akiva, M. and S.R. Lerman (1985) Discrete Choice Analysis: Theory and Application to
Travel Demand. Cambridge: MIT.
133
FUSING RESEARCH DATA WITH SOCIAL MEDIA MONITORING TO
CREATE VALUE
KARLAN WITT
DEB PLOSKONKA
CAMBIA INFORMATION GROUP
OVERVIEW
Many organizations collect data across multiple stakeholders (e.g., customers, shareholders,
investors) and sources (e.g., media, research, investors), intending to use this information to
become more nimble and efficient, able to keep pace with rapidly changing markets while
proactively managing business risks. In reality, companies often find themselves drowning in
data, struggling to uncover relevant insights or common themes that can inform their business
and media strategy. The proliferation of social media, and the accompanying volume of news and
information, has added a layer of complexity to the already overwhelming repository of data that
organizations are mining. So the data is “big” long before the promise of “Big Data” leverage
and insights are seen. Attempts to analyze data streams individually often fail to uncover the
most relevant insights, as the resulting information typically remains in its original silo rather
than informing the broader organization. When consuming social media, companies are best
served by fusing social media data with data from other sources to uncover cohesive themes and
outcomes that can be used to drive value across the organization.
USING DATA TO UNLOCK VALUE IN AN ORGANIZATION
Companies typically move through three distinct stages of data integration as they try to use
data to unlock value in their organization:
Stages of Data Integration
Stage 1: Instill Appetite for Data
This early stage often begins with executives saying, “We need to track everything important
to our business.” In Stage 1, companies develop an appetite for data, but face a number of
challenges putting the data to work. The richness and immediacy of social media feedback
provides opportunity for organizations to quickly identify risks to brand health, opportunities for
customer engagement, and other sources of value creation, making the incorporation of social
135
media data an important component of business intelligence. However, these explosively rich
digital channels often leave companies drowning in data.
With over 300 million Facebook users, 200 million bloggers, and 25 million Twitter users,
companies can no longer control the flood of information circulating about their firms.
Translating data into metrics can provide some insights, but organizations often struggle with:
 Interacting with the tools set up for accessing these new data streams;
 Gaining a clear understanding of what the metrics mean, what a strong or weak
performance on each metric would look like, and what impact they might represent for
the business; and
 Managing the volume of incoming information to identify what should be escalated for
potential action.
To address these challenges, data streams are often driven into silos of expertise within an
organization. In Stage 1 firms, the silos, such as Social Media, Traditional Media, Paid Media
Data, and Research, seldom work closely together to bring all the organization’s data to full Big
Data potential. Instead, the periodic metrics from a silo such as social media (number of blogs,
tweets, news mentions, etc.) published by silos lack richness and recommended action, leading
organizations to realize they don’t need data, they need information.
Stage 2: Translate Data into Information
In Stage 2, companies often engage external expertise, particularly in social media, in an
effort to understand how best to digest and prioritize information. In outsourcing the data
consumption and analysis, organizations may see reports with norms and additional contextual
data, but still no “answers.” In addition, though an organization may have a more experienced
person monitoring and condensing information, the data has effectively become even more
siloed, with research and media expertise typically having limited interaction. Most organizations
remain in this stage of data use, utilizing dashboards and publishing reports across the
organization, but never truly understanding how organizations faced with an influx of data can
intelligently consume information to drive value creation.
Stage 3: Create Value
Organizations that move to Stage 3 fuse social media metrics with other research data
streams to fully inform their media strategy and other actions they might take. The following
approach outlines four steps a company can take to help turn their data into actionable
information.
1. Identify Key Drivers of Desired Business Outcomes Using Market Research: In
today’s globally competitive environment, company perception is impacted by a broad
variety of stakeholders, each with a different focus on what matters. In this first step,
organizations can identify the Key Performance Indicators (attributes, value propositions,
core brand values, etc.) that are most important to each stakeholder audience. From this,
derived importance scores are calculated to quantify the importance of each KPI.
As seen below, research indicates different topics are more or less important to different
stakeholders. This becomes important as we merge this data into the social media
136
monitoring effort. As spikes occur in social media for various topics, a differential
response by stakeholder can occur, informed by the derived importance from research.
Example: Topic Importance by
Stakeholder
2. Identify Thresholds: Once the importance scores for KPI’s are established,
organizations must identify thresholds for each audience segment.
Alternative Approaches for Setting “Alert Thresholds”
1. Set a predefined threshold, e.g., 10,000 mentions warrants attention.
2. Compare to past data. Set a multiplier of the absolute number of mentions
over a given time period, e.g., 2x, 3x; alternatively, use the Poisson test to
identify a statistical departure from the expected.
3. Compare to past data. If the distribution is normal, set alerts for a certain
number of standard deviations from the mean, if not use the 90th or 95th
percentile of historical data for similar time periods.
4. Model historical data with time series analyses to account for trends
and/or seasonality in the data.
From this step, organizations should also determine the sensitivity of alerts based on
individual client preferences.
137
“High” Sensitivity Alerts are for any time there is a spike in volume that may or may not
have significant business impacts.
“Low” Sensitivity Alerts are only for extreme situations which will likely have business
implications.
138
3. Media Monitoring: Spikes in media coverage are often event-driven, and by coding
incoming media by topic or theme, clients can be cued to which topics may be spiking at
any given moment. In the example below, digital media streams from Forums, Facebook,
YouTube, News, Twitter, and Blogs are monitored, with a specific focus on two key
topics of interest.
Example: Media Monitoring by Topic
Similarly, information can be monitored by topic on a dashboard, with topic volume
spikes triggering alerts delivered according to their importance, as in the example below:
139
Although media monitoring in this instance is set up by topic, fusing research data with
social media data allows the importance to each stakeholder group to be identified, as in
the example below:
4. Model Impact on KPI’s: Following an event, organizations can model the impact on the
Key Performance Indicators by audience. Analyzing pre- and post-event measures across
KPIs will help to determine the magnitude of any impact, and will also help uncover if a
specific sub-group within an audience was impacted more severely than others.
Identifying the key attributes most impacted within the most affected sub-group would
suggest a course of action that enables an organization to prioritize its resources.
CASE STUDY
A 2012 study conducted by Cambia Information Group investigated the potential impact of
social media across an array of KPIs among several stakeholders of interest for a major retailer.
Cambia had been conducting primary research for this client for a number of years among these
audiences and was able to supplement this research with media performance data across the
social media spectrum.
Step 1: Identify Key Drivers of Desired Business Outcomes Using Market Research
This first step looks only at the research data. The research data used is from an on-going
tracking study that incorporates both perceptual and experiential-type attributes about this
company and top competitors. From the research data, we find that different topics are more or
less important to different stakeholders. The data shown in this case study is real, although the
company and attribute labels have been abstracted to ensure the learnings the company gained
through this analysis remain a competitive advantage.
The chart below shows the beta values across the top KPIs across all stakeholder groups. If
you are designing the research as a benchmark specifically to inform this type of intelligence
system, the attributes should be those things that can impact a firm’s success. For example, if you
are designing attributes for a conjoint study and it is about cars, and all competitors in the study
have equivalent safety records, safety may not make the list of attributes to be included. Choice
of color or 2- vs. 4-doors might be included.
140
However, when you examine the automotive industry over time with various recalls and
elements that have caused significant brand damage among car manufacturers, safety would be a
key element. We recommend that safety would be a variable that is included in a study informing
a company about the media and what topics have the ability to impact the firm (especially
negatively). Car color and 2- vs. 4-doors would not be included in this type of study.
Just looking at the red-to-green conditional formatting of the betas on the table below, it is
immediately clear that the importance values vary within and between key stakeholder groups.
Step 2: Identify Thresholds for Topics
The second step moves from working with the research data to working with the social media
data sources. The goal of this step is to develop a quantitative measure of social media activity
and set a threshold that, once reached, can trigger a notification to the client. This is a multi-step
process. For this data set, we used the following methodology:
1. Relevant topics are identified to parallel those chosen for the research. Social media
monitoring tools have the ability to listen to the various channels and tag the pieces that
fall under each topic.
2. Distributions of volume by day by topic were studied for their characteristics. While
some topics maintained a normal distribution, others did not. Given the lack of normality,
an approach involving mean and standard deviation was discarded.
3. Setting a pre-defined threshold was discarded as too difficult to support in a client
context. Additionally, the threshold would need to take into account the increasing
volume of social media over time.
4. Time series analyses would have been intriguing and are an extension to be considered
down the road, although it requires specialized software and a client who is comfortable
with advanced modeling.
5. Distributions by day evidence a “heartbeat” pattern—low on the weekends, higher on the
weekdays. Thresholds need to account for this differential. Individuals clearly engage in
141
social media behavior while at work—or more generously, perhaps as part of their role at
work.
6. For an approach that a client could readily explain to others, it was settled on referencing
the median of the non-normal distribution, and from this point forward taking the 90th or
95th percentile and flagging it for alerts. Given that some clients may wish to be notified
more often, an 85th percentile is also offered. Cuts were identified for both weekday and
weekend, and to account for the rise in social media volume, reference no more than the
past 6 months of data.
So the thresholds were set up with the high (85th percentile) and low (95th percentile)
sensitivity levels. For our client, a manager-level team member received all high-sensitivity
notifications (high sensitivity as described earlier means it detects every small movement and
sends a notice). Senior staff received only low sensitivity notices. Since these were only the top
5% of all events, these were hypothesized to carry potential business implications.
Step 3: Media Monitoring
This step is where the research analytics and the social media analytics come together. The
attributes measured in the research and for which we have importance values by attribute by
stakeholder group are aligned with the topics that have been set up in the social media
monitoring tools. Because we were dealing with multiple topics across multiple stakeholder
groups, we chose to extend the online reporting for the survey data to provide a way to view the
status of all topics, in addition to the email alerts.
142
Looking at this by topic (topics are equivalent to attributes, like “safety”), shows the current
status of each, such as the view below:
The specific way these are shown can be incorporated in many different ways depending on
the systems available to you.
Step 4: Model Impact on KPI’s
Although media monitoring is set up by topic, the application of the research data allows the
importance to each stakeholder group to be identified. As an example, a spike in social media
about low employee wages or other fair labor practice violations might have a negative impact
on the employee audience, a moderate impact on customers, and no impact on the other
stakeholder groups.
143
Using this data, our client was very rapidly able to respond to an event that popped in the
media that was unexpected. The company was immediately able to identify which audiences
would potentially be impacted by the media coverage, and focus available resources in
messaging to the right audiences.
CONCLUSION:
This engagement enabled the firm to take three key steps:
1. Identify problem areas or issues, and directly engage key stakeholder groups, in this case
the Voting Public and their Investors;
2. Understand the window of opportunity (time lag) between negative coverage and its
impact on the organization’s brand health;
3. Predict the brand health impact from social media channels, and affect that impact
through messaging of their own.
Potential Extensions for inclusion of alert or risk ratings include:
1. Provide “share of conversation” alerts,
2. Develop alert ratings within segments,
3. Incorporate potential media exposure to calculate risk ratio for each stakeholder group to
any particular published item,
4. Expand the model which includes the impact of each source,
5. A firm’s overall strategy.
144
Karlan Witt
145
BRAND IMAGERY MEASUREMENT:
ASSESSMENT OF CURRENT PRACTICE AND A NEW APPROACH1
PAUL RICHARD MCCULLOUGH
MACRO CONSULTING, INC.
EXECUTIVE SUMMARY
Brand imagery research is an important and common component of market research
programs. Traditional approaches, e.g., ratings scales, have serious limitations and may even
sometimes be misleading.
MaxDiff scaling adequately addresses the major problems associated with traditional scaling
methods, but historically has had, within the context of brand imagery measurement, at least two
serious limitations of its own. Until recently, MaxDiff scores were comparable only to items
within the MaxDiff exercise. Traditional MaxDiff scores are relative, not absolute. Dual
Response (anchored)MaxDiff has substantially reduced this first problem but may have done so
at the price of reintroducing scale usage bias. The second problem remains: MaxDiff exercises
that span a reasonable number of brands and brand imagery statements often take too long to
complete.
The purpose of this paper is to review the practice and limitations of traditional brand
measurement techniques and to suggest a novel application of Dual Response MaxDiff that
provides a superior brand imagery measurement methodology that increases inter-item
discrimination and predictive validity and eliminates both brand halo and scale usage bias.
INTRODUCTION
Brand imagery research is an important and common component of most market research
programs. Understanding the strengths and weaknesses of a brand, as well as its competitors, is
fundamental to any marketing strategy. Ideally, any brand imagery analysis would not only
include a brand profile, providing an accurate comparison across brands, attributes and
respondents, but also an understanding of brand drivers or hot buttons.
Any brand imagery measurement methodology should, at a minimum, provide the following:



Discrimination between attributes, for a given brand (inter-attribute comparisons)
Discrimination between respondents or segments, for a given brand and attribute
(inter-respondent comparisons)
Good fitting choice or purchase interest model to identify brand drivers (predictive
validity)
With traditional approaches to brand imagery measurement, there are typically three
interdependent issues to address:


1
Minimal variance across items, i.e., flat responses
Brand halo
The author wishes to thank Survey Sampling International for generously donating a portion of the sample used in this paper.
147

Scale usage bias
Resulting data are typically non-discriminating, highly correlated and potentially misleading.
With high collinearity, regression coefficients may actually have reversed signs, leading to
absurd conclusions, e.g., lower quality increases purchase interest.
While scale usage bias may theoretically be removed via modeling, there is reason to suspect
any analytic attempt to remove brand halo since brand halo and real brand perceptions are
typically confounded. That is, it is difficult to know whether a respondent’s high rating of Brand
A on perceived quality, for example, is due to brand halo, scale usage bias or actual perception.
Thus, the ideal brand imagery measurement technique will exclude brand halo at the data
collection stage rather than attempt to correct for it at the analytic stage. Similarly, the ideal
brand imagery measurement technique will eliminate scale usage bias at the data collection stage
as well.
While the problems with traditional measurement techniques are well known, they continue
to be widely used in practice. Familiarity and simplicity are, no doubt, appealing benefits of
these techniques. Among the various methods used historically, the literature suggests that
comparative scales may be slightly superior. An example of a comparative scale is below:
Some alternative techniques have also garnered attention: MaxDiff scaling, method of paired
comparisons (MPC) and Q-sort. With the exception of Dual Response MaxDiff (DR MD), these
techniques all involve relative measures rather than absolute.
MaxDiff scaling, MPC and Q-sort all are scale-free (no scale usage bias), potentially have no
brand halo2 and demonstrate more discriminating power than more traditional measuring
techniques.
MPC is a special case of MaxDiff; as it has been shown to be slightly less effective it will not
be further discussed separately.
With MaxDiff scaling, the respondent is shown a random subset of items and asked to pick
which he/she most agrees with and which he/she least agrees with. The respondent is then shown
several more subsets of items. A typical MaxDiff question is shown below:
2
These techniques do not contain brand halo effects if and only if the brand imagery measures are collected for each brand separately rather than
pooled.
148
Traditional MaxDiff3
With Q-sorting, the respondent is asked to place into a series of “buckets” a set of items, or
brand image attributes, from best describes the brand to least describes the brand. The number of
items in each bucket roughly approximates a normal distribution. Thus, for 25 items, the number
of items per bucket might be:
First bucket
Second bucket
Third bucket
Fourth bucket
Fifth bucket
Sixth bucket
Seventh bucket
1 item
2 items
5 items
9 items
5 items
2 items
1 item
MaxDiff and Q-sorting adequately address two of the major issues surrounding monadic
scales, inter-attribute comparisons and predictive validity, but due to their relative structure do
not allow inter-brand comparisons. That is, MaxDiff and Q-sorting will determine which brand
imagery statements have higher or lower scores than other brand imagery statements for a given
brand but can’t determine which brand has a higher score than any other brand on any given
statement. Some would argue that MaxDiff scaling also does not allow inter-respondent
comparisons due to the scale factor. Additionally, as a practical matter, both techniques currently
accommodate fewer brands and/or attributes than traditional techniques.
Both MaxDiff scaling and Q-sorting take much longer to field than other data collection
techniques and are not comparable across studies with different brand and/or attribute sets. Qsorting takes less time to complete than MaxDiff and is somewhat less discriminating.
As mentioned earlier, MaxDiff can be made comparable across studies by incorporating the
Dual Response version of MaxDiff, which allows the estimation of an absolute reference point.
This reference point may come at a price. The inclusion of an anchor point in MaxDiff exercises
may reintroduce scale usage bias into the data set.
However, for Q-sorting, there is currently no known approach to establish an absolute
reference point. For that reason, Q-sorting, for the purposes of this paper, is eliminated as a
potential solution to the brand measurement problem.
Also, for both MaxDiff and Q-sorting the issue of data collection would need to be
addressed. As noted earlier, to remove brand halo from either a MaxDiff-based or Q-sort-based
3
The form of MaxDiff scaling used in brand imagery measurement is referred to as Brand-Anchored MaxDiff (BA MD)
149
brand measurement exercise, it will be necessary to collect brand imagery data on each brand
separately, referred to here as Brand-Anchored MaxDiff. If the brands are pooled in the exercise,
brand halo would remain. Thus, there is the very real challenge of designing the survey in such a
way as to collect an adequate amount of information to accurately assess brand imagery at the
disaggregate level without overburdening the respondent.
Although one could estimate an aggregate level choice model to estimate brand ratings, that
approach is not considered viable here because disaggregate brand ratings data are the current
standard. Aggregate estimates would yield neither familiar nor practical data. Specifically,
without disaggregate data, common cross tabs of brand ratings would be impossible as would the
more advanced predictive model-based analyses.
A NEW APPROACH
Brand-Anchored MaxDiff, with the exception of being too lengthy to be practical, appears to
solve, or at least substantially mitigate, most of the major issues with traditional methods of
brand imagery measurement. The approach outlined below attempts to minimize the survey
length of Brand-Anchored MaxDiff by increasing the efficiency of two separate components of
the research process:


Survey instrument design
Utility estimation
Survey Instrument
A new MaxDiff question format, referred to here as Modified Brand-Anchored MaxDiff,
accommodates more brands and attributes than the standard design. The format of the Modified
Brand-Anchored MaxDiff used in Image MD is illustrated below:
150
To accommodate the Dual Response form of MaxDiff, a Direct Binary Response question is
asked prior to the MBA MD task set4:
To address the potential scale usage bias of MaxDiff exercises with Direct Binary Response,
a negative Direct Binary Response question, e.g., For each brand listed below, please check all
the attributes that you feel strongly do not describe the brand, is also included.5 As an additional
attempt to mitigate scale usage bias, the negative Direct Binary Response was asked in a slightly
different way for half the sample. Half the sample were asked the negative Direct Binary
Response question as above. The other half were asked a similar question except that
respondents were required to check as many negative items as they had checked positive. The
first approach is referred to here as unconstrained negative Direct Binary Response and the
second is referred to as constrained negative Direct Binary Response.
In summary, Image MD consists of an innovative MaxDiff exercise and two direct binary
response questions, as shown below:
4
5
This approach to Anchored MaxDiff was demonstrated to be faster to execute than the traditional Dual Response format (Lattery 2010).
Johnson and Fuller (2012) note that Direct Binary Response yields a different threshold than traditional Dual Response. By collecting both
positive and negative Direct Binary Response data, we will explore ways to mitigate this effect.
151
It is possible, in an online survey, to further increase data collection efficiency with the use of
some imaginative programming. We have developed an animated way to display Image MD
tasks which can be viewed at www.macroinc.com (Research Techniques tab, MaxDiff Item
Scaling).
Thus, the final form of the Image MD brand measurement technique can be described as
Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative Direct
Binary Response.
Utility Estimation
Further, an exploration was conducted to reduce the number of tasks seen by any one
respondent and still retain sufficiently accurate disaggregate brand measurement data. MaxDiff
utilities were estimated using a Latent Class Choice Model (LCCM) and using a Hierarchical
Bayes model (HB). By pooling data across similarly behaving respondents (in the LCCM), we
hoped to substantially reduce the number of MaxDiff tasks per respondent. This approach may
be further enhanced by the careful use of covariates. Another approach that may require fewer
MaxDiff tasks per person is to incorporate covariates in the upper model of an HB model or
running separate HB models for segments defined by some covariate.
To summarize, the proposed approach consists of:



152
Animated Modified Brand-Anchored MaxDiff Exercise
With Direct Binary Responses (both positive and negative)
Analytic-derived parsimony:
o Latent Class Choice Model:
 Estimate disaggregate MaxDiff utilities
 Use of covariates to enhance LCCM accuracy
o Hierarchical Bayes:
 HB with covariates in upper model
 Separate HB runs for covariate-defined segments
 Adjusted Priors6
RESEARCH OBJECTIVE
The objectives, then, of this paper are:






To compare this new data collection approach, Animated Modified Brand-Anchored
MaxDiff with Direct Binary Response, to a traditional approach using monadic rating
scales
To compare the positive Direct Binary Response and the combined positive and negative
Direct Binary Response
To confirm that Animated Modified Brand-Anchored MaxDiff with Direct Binary
Response eliminates brand halo
To explore ways to include an anchor point without reintroducing scale usage bias
To explore utility estimation accuracy of LCCM and HB using a reduced set of MaxDiff
tasks
To explore the efficacy of various potential covariates in LCCM and HB
STUDY DESIGN
A two cell design was employed: Traditional brand ratings scales in one cell and the new
MaxDiff approach in the other. Both cells were identical except in the method that brand imagery
data were collected:


Traditional brand ratings scales
o Three brands, each respondent seeing all three brands
o 12 brand imagery statements
Animated Modified Brand-Anchored MaxDiff with Direct Binary Response
o Three brands, each respondent seeing all three brands
o 12 brand imagery statements
o Positive and negative Direct Binary Response questions
Cells sizes were:


Monadic ratings cell - n = 436
Modified MaxDiff - n = 2,605
o Unconstrained negative DBR - n = 1,324
o Constrained negative DBR - n = 1,281
The larger sample size for the second cell was intended so that attempts to reduce the
minimum number of choice tasks via LCCM and/or HB could be fully explored.
6
McCullough (2009) demonstrates that tuning HB model priors can improve hit rates in sparse data sets.
153
Both cells contained:




Brand imagery measurement (ratings or MaxDiff)
Brand affinity measures
Demographics
Holdout attribute rankings data
RESULTS
Brand Halo
We check for brand halo using confirmatory factor analysis, building a latent factor to
capture any brand halo effect. If the brand halo exists, the brand halo latent factor will positively
influence scores on all items. We observed a clear brand halo effect among the ratings scale data,
as expected. The unanchored MaxDiff data showed no evidence of the effect, also as expected.
The positive direct binary response reintroduced the brand halo effect to the MaxDiff ratings at
least as strong as the ratings scale data. This was not expected. However, the effect seems to be
totally eliminated with the inclusion of either the constrained or unconstrained negative direct
binary question.
Brand Halo Confirmatory Factor Analytic Structure
154
Brand
Halo
Latent
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
Item 7
Item 8
Item 9
Item 10
Item 11
Item 12
Ratings
Std
Beta
0.85
0.84
0.90
0.86
0.77
0.85
0.83
0.82
0.88
0.87
0.77
0.88
Prob
***
***
***
***
***
***
***
***
***
***
***
na
No DBR
Std
Beta
-0.14
-0.38
-0.20
0.10
-0.68
-0.82
0.69
0.24
0.58
0.42
-0.05
0.26
Prob
***
***
***
***
***
***
***
***
***
***
0.02
na
Positive DBR
Std
Beta
0.90
0.78
0.95
0.90
0.88
0.87
0.83
0.75
0.90
0.94
0.85
0.91
Prob
***
***
***
***
***
***
***
***
***
***
***
na
Unconstrained
Negative DBR
Std
Beta
Prob
0.44
***
-0.56
***
0.42
***
0.30
***
0.03
0.25
-0.21
***
0.42
***
0.01
0.87
0.77
***
0.86
***
0.07
0.02
0.69
na
Constrained
Negative DBR
Std
Beta
Prob
0.27
***
-0.72
***
0.32
***
0.16
***
0.01
0.78
-0.24
***
0.20
***
-0.23
***
0.62
***
0.90
***
-0.12
***
0.53
na
Scale Usage
As with our examination of brand halo, we use confirmatory factor analysis to check for the
presence of a scale usage factor. We build in latent factors to capture brand halo per brand, and
build another latent factor to capture a scale usage bias independent of brand. If a scale usage
bias exists, the scale latent factor should load positively on all items for all brands.
155
Scale Usage Bias and Brand Halo Confirmatory Factor Analytic Structure
We observe an obvious scale usage effect with the ratings data, where the scale usage latent
loads positively on all 36 items. Again, the MaxDiff with only positive direct binary response
shows some indication of scale usage bias, even with all three brand halo latents simultaneously
accounting for a great deal of collinearity. Traditional MaxDiff, and the two versions including
positive and negative direct binary responses all show no evidence of a scale usage effect.
Scale Usage
Latent
Number of
Negative
Loadings
Number of
Statistically
Significant
loadings
156
Ratings
No DBR
Positive
DBR
Unconstrained
Negative DBR
Constrained
Negative
DBR
0
14
5
10
15
36
31
29
33
30
Predictive Validity
In the study design we included a holdout task which asked respondents to rank their top
three item choices per brand, giving us a way to test the accuracy of the various ratings/utilities
we collected. In the case of all MaxDiff data we compared the top three scoring items to the top
three ranked holdout items per person, and computed the hit rate. This approach could not be
directly applied to scale ratings data due to the frequency of flat responses (e.g., it is impossible
to identify top three if all items were rated the same). For the ratings data we estimated hit rate
using this approach: if the highest rated item from the holdout received the highest ratings score
which was shared by n items, we added 1/n to the hit rate. Similarly, the second and third highest
ranked holdout items received an adjusted hit point if those items were among the top 3 rated
items.
We observe that each of the MaxDiff data sets vastly outperformed ratings scale data, which
performed roughly the same as randomly guessing the top three ranked items.
Hit
Rates
Random
Numbers
Ratings
No
DBR
Positive
DBR
Unconstrained
Negative DBR
Constrained
Negative
DBR
1 of 1
8%
14%
27%
28%
27%
26%
32%
30%
62%
64%
62%
65%
61%
51%
86%
87%
86%
88%
(1 or
2)
of 2
(1, 2
or 3)
of 3
Inter-item discrimination
Glancing visually at the resulting item scores, we can see that each of the MaxDiff versions
show greater inter-item discrimination, and among those, both negative direct binary versions
bring the lower performing brand closer to the other two brands.
157
Ratings Scales
MaxDiff with Positive DBR
MaxDiff with Positive DBR & Constrained Negative DBR
158
MaxDiff with Positive DBR & Unconstrained Negative DBR
To confirm, we considered how many statistically significant differences between statements
could be observed within each brand per data collection method. The ratings scale data yielded
fewest statistically significant differences across items, while the MaxDiff with positive and
unconstrained negative direct binary responses yielded the most. Traditional MaxDiff and
MaxDiff with positive and constrained negative direct binary responses also performed very
well, while the MaxDiff with only positive direct binary performed much better than ratings
scale data, but clearly not as well as the remaining three MaxDiff methods.
Average number
of statistically
significant
differences across
12 items
Ratings
Brand#1
No DBR
Positive
DBR
Unconstrained
Negative DBR
Constrained
Negative
DBR
1.75
4.46
3.9
4.3
4.68
New Brand
0
4.28
3.16
4.25
4.5
Brand#2
1
4.69
3.78
4.48
4.7
Completion Metrics
With using a more sophisticated data collection method come a few costs in respondent
burden. It took respondents much longer to complete any of the MaxDiff exercises than it took
them to complete the simple ratings scales. The dropout rate during the brand imagery section of
the survey (measured as percentage of respondents who began that section but failed to finish it)
was also much higher among the MaxDiff versions. Though on the plus side for the MaxDiff
versions, when preparing the data for analysis we were forced to drop far fewer respondents due
to flat-lining.
159
Ratings
All
MaxDiff
Versions
Brand Image Measurement Time
(Minutes)
1.7
6
Incompletion Rate
9%
31%
Post-field drop rate
32%
4%
Exploration to Reduce Number of Tasks Necessary
We find these results to be generally encouraging, but would like to explore if anything can
be done to reduce the increased respondent burden and dropout rates. Can we reduce the number
of tasks each respondent is shown, without compromising the predictive validity of the estimated
utilities? To find out, we estimated disaggregate utilities using two different estimation methods
(Latent Class and Hierarchical Bayes), varying the numbers of tasks, and using certain additional
tools to bolster the quality of the data (using covariates, or adjusting priors, etc.).
We continued only with the two MaxDiff methods with both positive and negative direct
binary responses, as those two methods proved best in our analysis. All estimation routines were
run for both the unconstrained and constrained versions, allowing us to further compare these
two methods.
Our chosen covariates included home ownership (rent vs. own), gender, purchase likelihood
for the brand we were researching, and a few others. Including these covariates when estimating
utilities in HB should yield better individual results by allowing the software to make more
educated estimates based on respondents’ like peers.
With covariates in place, utilities were estimated using data from 8 (full sample), 4, and 2
MaxDiff tasks, and hit rates were computed for each run. We were surprised to discover that
using only 2 tasks yielded only slightly less accuracy than using all 8 tasks. And in all cases, hit
rates seem to be mostly maintained despite decreased data.
Using Latent Class the utilities were estimated again using these same 6 data sub-sets. As
with HB, reducing the number of tasks used to estimate the utilities had minimal effect on the hit
rates. It is worth noting here, that when using all 8 MaxDiff tasks latent class noticeably
underperforms Hierarchical Bayes, but this disparity decreases as tasks are dropped.
160
Various Task
Hit Rates
HB
LC
Unconstrained
Negative DBR
8 Tasks 4 Tasks 2 Tasks
Constrained
Negative DBR
8 Tasks 4 Tasks 2 Tasks
1 of 1
27%
21%
20%
26%
24%
22%
(1 or 2)
of 2
62%
59%
58%
65%
61%
59%
(1, 2 or 3)
of 3
86%
85%
82%
88%
86%
85%
1 of 1
19%
20%
19%
21%
21%
22%
(1 or 2)
of 2
54%
57%
56%
61%
59%
56%
(1, 2 or 3)
of 3
81%
82%
83%
84%
84%
82%
In estimating utilities in Hierarchical Bayes, it is possible to adjust the Prior degrees of
freedom and the Prior variance. Generally speaking, adjusting these values allows the researcher
to change the emphasis placed on the upper level model. In dealing with sparse data sets,
adjusting these values may lead to more robust individual utility estimates.
Utilities were estimated with data from 4 tasks, and with Prior degrees of freedom from 2 to
1000 (default is 5), and Prior variance from 0.5 to 10 (default is 2). Hit rates were examined at
various points on these ranges, and compared to the default settings. After considering dozens of
non-default configurations we observed essentially zero change in hit rates.
At this point it seemed that there was nothing that could diminish the quality of these
utilities, which was a suspicious finding. In searching for a possible explanation, we
hypothesized that these data simply have very little heterogeneity. The category of product being
researched is not emotionally engaging (light bulbs), and the brands being studied are not very
differentiated. To test this hypothesis, an additional utility estimation was performed, using only
data from 2 tasks, and with a drastically reduced sample size of 105. Hit rates were computed for
the low sample run both at the disaggregate level, that is using unique individual utilities, and
then again with each respondents utilities equal to the average of the sample (constant utilities).
161
Random
Choices
1 of 1
(1 or 2)
of 2
(1, 2 or
3)
of 3
Unconstrained Negative DBR
HB
HB
HB
2 Tasks
8 Tasks
2 Tasks
N=105
N=1,324
N=105
Constant Utils
8%
27%
22%
25%
32%
62%
59%
61%
61%
86%
82%
82%
These results seem to suggest that there is very little heterogeneity for our models to capture
in this particular data; explaining why even low task utility estimates yield fairly high hit rates.
Unfortunately, this means what we cannot say whether we can reduce survey length of this new
approach by reducing the number of tasks needed for estimation.
Summary of Results
Provides Absolute
Reference Point
Brand Halo
Scale Usage Bias
Inter-Item
Discrimination
Predictive Validity
Complete Time
Dropout Rate
Post-Field Drop Rate
Constrained
Unconstrained
Negative
Negative DBR
DBR
Ratings
No DBR
Positive
DBR
No
Yes
Yes
No
No
No
Yes
Yes
Yes
Yes
No
No
Yes
No
No
Very Low
Very Low
Fast
Low
High
High
High
Slow
High
Low
Fairly High
High
Slow
High
Low
High
High
Slow
High
Low
High
High
Slow
High
Low
CONCLUSIONS
The form of MaxDiff referred to here as Animated Modified Brand-Anchored MaxDiff
Scaling with both Positive and Negative Direct Binary Response is superior to rating scales for
measuring brand imagery:
-Better inter-item discrimination
-Better predictive validity
-Elimination of brand halo
-Elimination of scale usage bias
-Fewer invalid completes
162
Using positive DBR alone to estimate MaxDiff utilities reintroduces brand halo and possibly
scale usage bias. Positive DBR combined with some form of negative DBR to estimate MaxDiff
utilities eliminates both brand halo and scale usage bias. Utilities estimated with Positive DBR
have slightly weaker inter-item discrimination than utilities estimated with Negative DBR.
The implication to these findings regarding DBR is that perhaps MaxDiff, if anchored,
should always incorporate both positive and negative DBR since positive DBR alone produces
highly correlated MaxDiff utilities with less inter-item discrimination.
Another, more direct implication, is that Brand-Anchored MaxDiff with both positive and
negative DBR is superior to Brand-Anchored MaxDiff with only positive DBR for measuring
brand imagery.
Animated Modified Brand-Anchored MaxDiff Scaling with both Positive and Negative
Direct Binary Response takes longer to administer and has higher incompletion rates, however,
and further work needs to be done to make the data collection and utility estimation procedures
more efficient.
Paul Richard
McCullough
REFERENCES
Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2007), “Making MaxDiff more informative:
statistical data fusion by way of latent variable modeling,” 2007 Sawtooth Software
Conference Proceedings, Santa Rosa, CA
Bacon, L., Lenk, P., Seryakova, K., and Veccia, E. (2008), “Comparing Apples to Oranges,”
Marketing Research Magazine, Spring 2008
Bochenholt, U. (2004), “Comparative judgements as an alternative to ratings: Identifying the
scale origin,” Psychological Methods
Chrzan, Keith and Natalia Golovashkina (2006), “An Empirical Test of Six Stated Importance
Measures,” International Journal of Marketing Research
Chrzan, Keith and Doug Malcom (2007), “An Empirical Test of Alternative Brand Systems,”
2007 Sawtooth Software Conference Proceedings, Santa Rosa, CA
163
Chrzan, Keith and Jeremy Griffiths (2005), “An Empirical Test of Brand-Anchored Maximum
Difference Scaling,” 2005 Design and Innovations Conference, Berlin
Cohen, Steven H. (2003), “Maximum Difference Scaling: Improved Measures of Importance and
Preference for Segmentation,” Sawtooth Software Research Paper Series
Dillon, William R., Thomas J. Madden, Amna Kirmani and Soumen Mukherjee (2001),
"Understanding What’s in a Brand Rating: A Model for Assessing Brand and Attribute
Effects and Their Relationship to Brand Equity," JMR
Hendrix, Phil, and Drucker, Stuart (2007), “Alternative Approaches to Maxdiff With Large Sets
of Disparate Items-Augmented and Tailored Maxdiff, 2007 Sawtooth Software Conference
Proceedings, Santa Rosa, CA
Horne, Jack, Bob Rayner, Reg Baker and Silvo Lenart (2012), “Continued Investigation Into the
Role of the ‘Anchor’ in MaxDiff and Related Tradeoff Exercises,” 2012 Sawtooth Software
Conference Proceedings, Orlando, FL
Johnson, Paul, and Brent Fuller (2012), “Optimizing Pricing of Mobile Apps with Multiple
Thresholds in Anchored MaxDiff,” 2012 Sawtooth Software Conference Proceedings,
Orlando, FL
Lattery, Kevin (2010), “Anchoring Maximum Difference Scaling Against a Threshold-Dual
Response and Direct Binary Responses,” 2010 Sawtooth Software Conference Proceedings,
Newport Beach, CA
Louviere, J.J., Marley, A.A.J., Flynn, T., Pihlens, D. (2009), “Best-Worst Scaling: Theory,
Methods and Applications,” CenSoc: forthcoming.
Magidson, J., and Vermunt, J.K. (2007a), “Use of a random intercept in latent class regression
models to remove response level effects in ratings data. Bulletin of the International
Statistical Institute, 56th Session, paper #1604, 1–4. ISI 2007: Lisboa, Portugal
Magidson, J., and Vermunt, J.K. (2007b), “Removing the scale factor confound in multinomial
logit choice models to obtain better estimates of preference,” 2007 Sawtooth Software
Conference Proceedings, Santa Rosa, CA
McCullough, Paul Richard (2009), “Comparing Hierarchical Bayes and Latent Class Choice:
Practical Issues for Sparse Data Sets,” 2009 Sawtooth Software Conference Proceedings,
Delray Beach, FL
Vermunt and Magidson (2008), LG Syntax User’s Guide: Manual for Latent GOLD Choice 4.5
Syntax Module
Wirth, Ralph, and Wolfrath, Annette (2012), “Using MaxDiff for Evaluating Very Large Sets of
Items,” 2012 Sawtooth Software Conference Proceedings, Orlando, FL
164
ACBC REVISITED
MARCO HOOGERBRUGGE
JEROEN HARDON
CHRISTOPHER FOTENOS
SKIM GROUP
ABSTRACT
Adaptive Choice-Based Conjoint (ACBC) was developed by Sawtooth Software in 2009 as
an alternative to their classic CBC software in order to obtain better respondent data in complex
choice situations. Similar to Adaptive Conjoint Analysis (ACA) many years ago, this alternative
adapts the design of the choice experiment to the specific preferences of each respondent.
Despite its strengths, ACBC has not garnered the popularity that ACA did as CBC has
maintained the dominant position in the discrete choice modeling market. There are several
possibilities concerning the way ACBC is assessed and its various features that may explain why
this has happened.
In this paper, we compare ACBC to several other methodologies and variations of ACBC
itself in order to assess its performance and look into potential ways of improving it. What we
show is that ACBC does indeed perform very well for modeling choice behavior in complex
markets and it is robust enough to allow for simplifications without harming results. We also
present a new discrete choice methodology called Dynamic CBC, which combines features of
ACBC and CBC to provide a strong alternative to CBC in situations where running an ACBC
study may not be applicable.
Though this paper will touch on some of the details of the standard ACBC procedure, for a
more in-depth overview and introduction to the methodology please refer to the ACBC Technical
Paper published in 2009 from the Sawtooth Software Technical Paper Series.
BACKGROUND
Our methodological hypotheses as to why ACBC has not yet caught up to the popularity of
CBC are related to the way in which the methodology has been critiqued as well as how
respondents take the exercise:
1. Our first point relates to the way in which comparison tests between ACBC and CBC
have been performed in the past. We believe that ACBC will primarily be successful in
markets for which the philosophy behind ACBC is truly applicable. This means that the
market should consist of dozens of products so that consumers need to simplify their choices
upfront by consciously or subconsciously creating an evoked set of products from which to
choose. This evoked set can be different for each consumer: some consumers may restrict
themselves to one or more specific brands while other consumers may only shop within
specific price tiers. This aligns with the non-compensatory decision making behavior that
Sawtooth Software (2009) was aiming to model when designing the ACBC platform. For
example, shopping for technology products (laptops, tablets, etc.), subscriptions (mobile,
insurance), or cars may be very well suited for modeling via ACBC.
165
If we are studying such a market, the holdout task(s), which are the basis for comparison
between methodologies, should also reflect the complexity of the market! Simply put, the
holdout tasks should be similar to the scenario that we wish to simulate. Whereas in a typical
ACBC or CBC choice exercise we may limit respondents to three to five concepts to ensure
that they assess all concepts, holdout tasks for complex markets can be more elaborate as
they are used for model assessment rather than deriving preference behavior.
2. For many respondents in past ACBC studies we have seen that they may not have a very
‘rich’ choice tournament because they reject too many concepts in the screening section of
the exercise. Some of them do not even get to the choice tournament or see just one choice
tournament task. In an attempt to curb this kind of behavior we have tried to encourage
respondents to allow more products through the screening section by moderately rephrasing
the question text and/or the text of the two answer options (would / would not consider). We
have seen that this helps a bit though not enough in expanding the choice tournament. The
simulation results are mostly based on the choices in the tournament, so fewer choice tasks
could potentially lead to a less accurate prediction. If ACBC may be adjusted such that the
choice tournament provides ‘richer’ data, the quality of the simulations (and for the holdout
predictions) may be improved.
3. Related to our first point on the realism of simulated scenarios, for many respondents we
often see that the choice tournament converges to a winning concept that has a price way
below what is available in the market (and below what is offered in the holdout task). With
the default settings of ACBC (-30% to +30% of the summed component price), we have seen
that approximately half of respondents end up with a winning concept that has a price 15–
30% below the market average. Because of this, we may not learn very much about what
these respondents would choose in a market with realistic prices. A way to avoid this is to
follow Sawtooth Software’s early recommendation to have an asymmetric price range (e.g.,
from 80% to 140% of market average) and possibly also to use a narrower price range if
doing so is more in line with market reality.
4. A completely different kind of hypothesis refers to the design algorithm of ACBC. In
ACBC, near-orthogonal designs are generated consisting of concepts that are “nearneighbors” to the respondent’s BYO task configuration while still including the full range of
levels across all attributes. The researcher has some input into this process in that they can
specify the total number of concepts to generate, the minimum and maximum number of
levels to vary from the BYO concept, and the percent deviation from the total summed price
as mentioned above (Sawtooth Software 2009). It may be the case that ACBC (at its default
settings) is too extreme in its design of concepts at the respondent level and the Hierarchical
Bayes estimation procedure may not be able to fully compensate for this in its borrowing
process across respondents. For this we propose two potential solutions:
a. A modest improvement for keeping the design as D-efficient as possible is to
maximize the number of attributes to be varied from the BYO task.
b. A solution for this problem may be found outside ACBC as well. An adaptive
method that starts with a standard CBC and preserves more of CBC’s D-
166
efficiency throughout the exercise may lead to a better balance between Defficiency and interactivity than is currently available in other methodologies.
THE MARKET
In order to test the hypotheses mentioned above, we needed to run our experiment in a
market that is suitably complex enough to evoke the non-compensatory choice behavior of
respondents that is captured by the ACBC exercise. We therefore decided to use the television
market in the United States. As the technology that goes into televisions has become more
advanced (think smart TVs, 3D, etc.) and the number of brands available has expanded, the
choice of a television has grown more elaborate as has the pricing structure of the category.
There are so many features that play a role in the price of a television and the importance of
these features between respondents can vary greatly. Based on research of televisions widely
available to consumers in the US, we ran our study with the following attributes and levels:
Attribute
Brand
Screen Size
Screen Type
Resolution
Wi-Fi Capability
3D Capability
Number of HDMI Inputs
Total Price
Levels
Sony, Samsung, LG, Panasonic, Vizio
22˝, 32˝, 42˝, 52˝, 57˝, 62˝, 67˝, 72˝
LED, LCD, LED-LCD, Plasma
720p, 1080p
Yes/No
Yes/No
0–3 connections
Summed attribute price ranged from $120 to $3,500
In order to qualify for the study, respondents had to be between the ages of 22 and 65, live
independently (i.e., not supported by their parents), and be in the market for a new television
sometime within the next 2 years. Respondent recruitment came through uSamp’s online panel.
THE METHODS
In total we looked into eight different methodologies in order to assess the performance of
standard ACBC and explore ways of improving it as per our previously mentioned hypotheses.
These methods included variations of ACBC and CBC as well as our own SKIM-developed
alternative called Dynamic CBC, which combines key features of both methods. It is important
to note that the same attributes and levels were used across all methods and that the designs
contained no prohibitions within or between attributes. Additionally, all methods showed
respondents a total summed component price for each television, consistent with a standard
ACBC exercise, and was taken by a sample of approximately 300 respondents (2,422 in total).
The methods are as follows:
A. Standard ACBC—When referring to standard ACBC, we mean that respondents
completed all portions of the ACBC exercise (BYO, screening section, choice
tournament) and that the settings were generally in line with those recommended by
Sawtooth Software including having between two and four attributes varied from the
BYO exercise in the generation of concepts, six screening tasks with four concepts each,
unacceptable questions after the third and fourth screening task, one must-have question
167
B.
C.
D.
E.
F.
G.
168
after the fifth screening task, and a price range varying between 80–140% of the total
summed component price. Additionally, for this particular leg price was excluded as an
attribute in the unacceptable questions asked during the screening section.
ACBC with price included in the unacceptable questions—This leg had the same
settings as the standard ACBC leg however price was included as an attribute in the
unacceptable questions of the screening section. The idea behind this is to increase the
number of concepts that make it through the screening section in order to create a richer
choice tournament for respondents, corresponding with our hypothesis that a longer, more
varied choice tournament could potentially result in better data. If respondents confirm
that a very high price range is unacceptable to them, the concepts they evaluate no longer
contain these higher price points, so we could expect respondents to accept more
concepts in the further evaluation. This of course also creates the risk that the respondent
is too conservative in screening based on price and leads to a much shorter choice
tournament.
ACBC without a screening section—Again going back to the hypothesis of the
“richness” of the choice tournament for respondents, this leg followed the same settings
as our baseline ACBC exercise but skipped the entire screening section of the
methodology. By skipping the screening section, this ensured that respondents would see
a full set of choice tasks (in this case 12). The end result is still respondent-specific as
designs are built with the “near-neighbor” concept in mind but they are just prevented
from further customizing their consideration set for the choice tournament. We thereby
have collected more data from the choice tournament for all respondents, providing more
information for utility estimation. Additionally, skipping the screening section may lead
to increased respondent engagement by way of shortening the overall length of interview.
ACBC with a narrower tested price range—To test our hypothesis on the reality of the
winning concept price for respondents, we included one test leg that had the equivalent
settings of the standard ACBC leg with the exception of using a narrower tested price
deviation from the total summed component price. For this cell a range of 90% to 120%
of the total summed component price was used as opposed to 80% to 140% as used by
the other legs. Although this is not fully testing our hypothesis as we are not comparing to
the default 70% to 130% range, we feel that the 80% to 140% range is already an
accepted better alternative to the default range and we can learn more by seeing if an
even narrower range improves results.
ACBC with 4 attributes varied from the BYO concept—We tested our last hypothesis
concerning the design algorithm of ACBC by including a leg which forced there to be
four attributes varied (out of seven possible) from the BYO concept across all concepts
generated for each respondent whereas all other ACBC legs had between two and four
attributes varied as previously mentioned. The logic behind this is to ensure that the nonBYO levels show up more often in the design and therefore push it closer to being more
level-balanced and therefore more statistically efficient.
Standard CBC—The main purpose of this leg was to serve as a comparison to standard
ACBC and the other methods.
Price balanced CBC—A CBC exercise that shows concepts of similar price on each
screen. This has been operationalized by means of between-concept prohibitions on price.
This is similar to the way ACBC concepts within a task are relatively utility balanced
since they are all built off variations of the respondent’s BYO concept. Although this is
not directly tied into one our hypotheses concerning ACBC, it is helpful in understanding
if applying a moderate utility balance within the CBC methodology could help improve
results.
H. Dynamic CBC—As previously mentioned, this method is designed by SKIM and
includes features of both CBC and ACBC. Just like a standard CBC exercise, Dynamic
CBC starts out with an orthogonal design space for all respondents as there is no BYO or
screening exercise prior to the start of the choice tasks. However like ACBC, the method
is adaptive in the way in which it displays the later tasks of the exercise. At several points
throughout the course of the exercise, one of the attributes in the pre-drawn orthogonal
design has its levels replaced with a level from a concept previously chosen by the
respondent. Between these replacements, respondents were asked which particular
attribute (in this case television feature) they focused on the most when making their
selections in previous tasks. The selection of the attribute to replace was done randomly
though the attribute that the respondent stated they focused on the most when making
their selection was given a higher probability of being drawn in this process. This idea is
relatively similar to the “near-neighbors” concept of ACBC in the sense that we are fixing
an attribute to a level which we are confident that the respondent prefers (like their BYO
preferred level) and thus force them to make trade-offs between other attributes to gain
more insights into their preferences in those areas. In this particular study, this adaptive
procedure occurred at three points in the exercise.
THE COMPARISON
At this point we have not yet fully covered how we went about testing our first hypothesis
concerning the way in which comparison tests have been performed between ACBC and other
methodologies. Aside from applying all methodologies to a complex market, in this case the
television market in the United States, the holdout tasks had to be representative of the typical
choice that a consumer might have to make in reality. After all, a more relevant simulation from
such a study is one that tries to mimic reality as best as possible. In order to better represent the
market, respondents completed a set of three holdout tasks consisting of twenty concepts each
(not including a none option). Within each holdout task, concepts were relatively similar to one
another in terms of their total price so as not to create “easy” decisions for respondents that
always went for a particular price tier or feature. We aimed to cover a different tier of the market
with each task so as not to alienate any respondents because of their preferred price tier. An
example holdout task can be seen below:
169
While quite often holdout tasks will be similar to the tasks of the choice exercise, in the case
of complex markets it would be very difficult to have many screening or choice tournament tasks
that show as many as twenty concepts on the screen and still expect reliable data from
respondents. In terms of the reality of the exercise, it is important to distinguish between the
reality of choice behavior and the decision in the end. Whereas the entire ACBC exercise can
help us better understand the non-compensatory choice behavior of respondents while only
showing three to four concepts at a time, we still must assess its predictive power in the context
of an “actual” complex purchase that we aim to model in the simulator.
As a result of having so many concepts available in the holdout tasks, the probability of a
successful holdout prediction is much lower than what one may be used to seeing in similar
research. For example, if we were to use hit rates as a means of assessing each methodology, by
random chance we would be correct just 5% of the time. Holdout hit rates also provide no
information about the proximity of the prediction, therefore giving no credit for when a method
comes close to predicting the correct concept. Take the following respondent simulation for
example:
Concept
Share of Preference
1
0.64%
11
1.48%
2
0.03%
12
0.88%
3
0.65%
13
29.78%
4
0.33%
14
0.95%
5
4.68%
15
27.31%
6
0.18%
16
12.99%
7
4.67%
17
4.67%
8
2.40%
18
0.17%
9
1.75%
19
4.28%
10
0.07%
20
2.11%
If, for example, in this particular scenario the respondent chose concept 15, which comes in
at a close second to concept 13, it would simply count the methodology as being incorrect
despite how close it came to being correct. By definition, the share of preference model is also
170
telling us that there is roughly a 70.22% chance that the respondent would not choose concept
13, so it is not necessarily incorrect in the sense of telling us that the respondent would not
choose concept 13 either.
Because of this potential for hit rates to understate the accuracy of a method, particularly one
with complex holdout tasks, we decided to use an alternative success metric: the means of the
share of preference of the concepts that the respondents chose in the three holdout tasks. In the
example above, if the respondent chose concept 15, their contribution to the mean share of
preference metric for the methodology that they took would be a score of 27.31%. As a
benchmark, this could be compared to the predicted share of preference of a random choice,
which in this case would be 5%. Please note that all mean share of preferences reported for the
remainder of this paper used a default exponent of one in the modeling process. While testing
values of the exponent, we found that any exponent value between 0.7 and 1 yielded similar
optimal results.
In addition, in the tables we have also added the “traditional” hit rate, comparing first choice
simulations with the actual answers to holdout tasks. A methodological note on the two measures
of performance has been added at the end of this paper.
RESPONDENTS ENJOYED BOTH CBC AND ACBC EXERCISES
As a first result, it is important to note that there were no significant differences (using an
alpha of 0.05) in respondent enjoyment or ease of understanding across the methodologies. At
the end of the survey respondents were asked whether they enjoyed taking the survey and
whether they found it difficult to fill in the questionnaire. Please note though that these questions
were asked in reference to the entire survey, not necessarily just the discrete choice portion of the
study. A summary of the results from these questions can be found below:
A key takeaway from this is that there is little need to worry about whether respondents will
have a difficult time taking a CBC study or enjoy it any less than an ACBC study despite being
less interactive. Though this is a purely directional result, it is interesting to note that the
171
methodology that rates least difficult to fill in was ACBC without a screening section, even less
so than a standard CBC exercise that does not include a BYO task.
MAIN RESULTS
Diving into the more important results now, what we see is that all variations of ACBC
significantly outperformed all variations of CBC, though there is no significant difference (using
an alpha of 0.05) between the different ACBC legs. Interestingly though, Dynamic CBC was
able to outperform both standard and price balanced CBC.
A summary of the results can be found in the chart below. As noticeable, all the above
conclusions hold for both metrics.
Some comments on specific legs:
-
-
-
172
ACBC with a price range of 90%-120% behaved as expected in the sense that the
concepts in the choice tournament had more realistic price levels. However, this did not
result in more accurate predictions.
ACBC with price in the unacceptable questions behaved as expected in the sense that no
less than 40% of respondents rejected concepts above a certain price point. However, this
did not result in a significantly richer choice tournament and consequently did not lead to
better predictions.
ACBC without a screening section behaved as expected in the sense that there was a
much richer choice tournament (by definition, because all near-neighbor concepts were in
the tournament). This did not lead to an increase in prediction performance, however. But
perhaps more interesting is the fact that the prediction performance did not decline either
in comparison with standard ACBC. Apparently a rich tournament compensates for
dropping the screening section, so it seems like the screening section can be skipped
altogether, saving a substantial amount of interview time (except for the most disengaged
respondents who formerly largely skipped the tournament).
INCREASING THE GRANULARITY OF THE PRICE ATTRIBUTE
As we mentioned earlier, each leg showed respondents a summed total price for each concept
and this price was allowed some variation according to the ranges that we specified in the setup.
This is done by taking the summed component price of the concept and multiplying it by a
random number in line with the specified variation range. Because of this process, price is a
continuous variable rather than discrete as with the other attributes in the study. Therefore in
order to estimate the price utilities, we chose to apply a piecewise utility estimation for the
attribute as is generally used in ACBC studies involving price. In the current ACBC software,
users are allowed to specify up to 12 cut-points in the tested price range for which price slopes
will be estimated between consecutive points. Using this method, the estimated slope is
consistent between each of the specified cut-points so it is important to choose these cut-points in
such a way that it best reflects the varying consumer price sensitivities across the range (i.e.,
identifying price thresholds or periods of relative inelasticity).
Given the wide range of prices tested in our study (as mentioned earlier, the summed
component price ranged from $120 to $3,500) we felt that our price range could be better
modeled using more cut-points than the software currently allows. Since the concepts shown to a
respondent within ACBC are near-neighbors of their BYO concept, it is quite possible that
through this and the screening exercise we only collected data for these respondents on a subset
of the total price range. This would therefore make much of the specified cut-points irrelevant to
this respondent and lead to a poorer reflection of their price sensitivity across the price range that
was most relevant to them. We saw this happening especially with respondents with a BYO
choice in the lowest price range.
Based on this and the relative count frequencies (percent of times chosen out of times shown)
in the price data, we decided to increase the number of cut-points to 28 and run the estimation
using CBC HB. As you can see in the data below, ACBC benefits much more from increasing the
number of cut-points than CBC (i.e., the predicted SoP in the ACBC legs increases, while the
predicted SoP in the CBC legs remains stable; the predicted traditional hit rate in the ACBC legs
remains nearly stable while for CBC it decreases). A reasonable explanation for this difference
between the ACBC and CBC legs is the fact that in ACBC legs the variation in prices in each
individual interview is generally a lot smaller than in a CBC study. So in ACBC studies it does
not harm to have that many cut-points and when you would look at the mean SoP metric, it
seems actually better to do so.
173
AN ALTERNATIVE MODEL COMPARISON
In order to make sure that our results were not purely the result of our mean share of
preference metric for comparing the methodologies, we also looked into the mean squared error
of the predicted share of preferences across respondents. Using this metric, we could see how
much the simulated preference shares deviated from the actual holdout task choices. This was
calculated as:
Where the aggregate concept share from the holdout task is the percent of respondents that
selected the concept in the holdout task and the aggregate mean predicted concept SoP is the
mean share of preference for the concept across all respondents.
As displayed in the table below, using mean squared error as a success metric also confirms
our previous results based on the mean share of preference metric:
174
When looking into these results, it was a bit concerning that the square root of the mean
squared error was close to the average aggregate share of preference for a concept (5% = 1/20
concepts), particularly for the CBC exercises. Upon further investigation, we noticed that there
was one particular concept in one of the holdout tasks that generates a large amount of the
squared error in the table above. This concept was the cheapest priced concept in the holdout task
that contained products of a higher price tier than the other two holdout tasks. At an aggregate
level, this particular concept had an actual holdout share of 43% but the predicted share of
preference was significantly lower: the ACBC modules had an average share of preference of
27% for this concept whereas the standard CBC leg (8%) and price-balanced CBC leg (17%)
performed much worse. By removing this one concept from the consideration set, it helped to
relieve the mean squared error metric quite a bit:
Although “only” one concept seems to primarily distort the results, it is nevertheless
meaningful to dive further into it. After all, we do not want to get an entirely wrong share from
175
the simulator for just one product, even if it is just one product. In the section below we have
looked at it in more depth. Should you be willing to skip this section we can already share with
you that we find that brand sensitivity seems to be overestimated by CBC-HB and ACBC-HB
and price sensitivity seems to be underestimated. This applies to all CBC and ACBC legs
(although ACBC legs slightly less so). The one concept with a completely wrong MSE
contribution is exactly the cheapest of the 20 concepts in that task, hence its predicted share is
way lower than its stated actual share.
CLUSTER ANALYSIS ON PREDICTED SHARES OF PREFERENCE FOR HOLDOUT TASKS
A common way to cluster respondents based on conjoint data is by means of Latent Class
analysis. In this way we get an understanding of differences between respondents regarding all
attribute levels in the study. A somewhat different approach is to cluster respondents based on
predicted shares of preference in the simulator. In this way we get an understanding of
differences between respondents regarding their preferences for actual products in the market.
While Latent Class is based on the entire spectrum of attribute levels, clustering on predicted
shares narrows the clustering down to what is relevant in the current market.
We took the latter approach and applied CCEA for clustering on the predicted shares of
preference for the concepts in the three holdout tasks (as mentioned earlier, the holdout tasks
were meant to be representative of a simulated complex market). We combined all eight study
legs together to get a robust sample for this type of analysis. This was possible because the
structure of the data and the scale of the data (shares of preference) is the same across all eight
legs. In retrospect we checked if there were significant differences in cluster membership
between the legs (there were some significant differences but these differences were not
particularly big).
The cluster analysis resulted in the following 2 clusters with each 5 sub-clusters, for a total of
10 clusters:
1. One cluster with respondents preferring a low end TV (up to some $600), about 1/3 of the
sample
 This cluster consists of 5 sub-clusters based on brand—the 5 sub-clusters largely
coincide with a strong preference for just one specific brand out of the 5 brands
tested, although in some sub-clusters there was a preference for two brands jointly.
2. The other cluster with respondents preferring a mid/high end TV (starting at some $600),
about 2/3 of the sample
 This cluster also consists of 5 sub-clusters based on brand, with the same remark as
before for the low end TV cluster.
The next step was to redo the MSE comparison by cluster and sub-cluster. Please note that
the shares of preference within a cluster by definition will reflect the low end or mid/high end
category and a certain brand preference, completely in line with the general description of the
cluster. After all, the clustering was based on these shares of preference.
The interesting thing is to see how the actual shares (based on actual answers in holdout
tasks) behave by sub-cluster. The following is merely an example of one sub-cluster, namely for
the “mid/high end Vizio cluster,” yet the same phenomenon applies in all of the other 9 subclusters. The following graph makes the comparison at a brand level (shares of the SKUs of each
176
brand together) and compares predicted shares of preference in the cluster with the actual shares
in the cluster, for each of the three choice tasks.
Since this is the “mid/high end Vizio cluster,” the predicted shares of preference for Vizio (in
the first three rows) are by definition much higher than for the other brands and are also nearly
equal for the three holdout tasks. However, the actual shares of Vizio in the holdout tasks (in the
last three rows) are very similar to the shares of Samsung and Sony while varying a lot more in
the three holdout tasks than in the predicted shares. So the predicted strong brand preference for
Vizio is not reflected at all in the actual holdout tasks!
The fact that the actual brand shares are quite different across the three tasks has a very likely
cause in the particular specifications of the holdout tasks:
1. In the first holdout task, Panasonic had the cheapest product in the whole task, followed
by Samsung. Both of these brands have a much higher share than predicted.
2. In the second holdout task, both Vizio and LG had the cheapest product in the whole task.
Indeed Vizio has a higher actual share than in the other tasks (but not as overwhelmingly
as in the prediction) while LG still has a low actual share (which at least confirms that
people in this cluster dislike LG).
3. In the third holdout task, Samsung had the cheapest product in the whole task and has a
much higher share than predicted. It is exactly this one concept (the cheapest one of
Samsung) that was distorting the whole aggregate MSE analysis that we discussed in the
previous section.
Less clear in the second holdout task, but very clear in the first and third holdout task, it
seems that we have a case here where the importance of brand is being overestimated while the
importance of price is being underestimated. The actual choices are much more based on price
than was predicted and much less based on brand. The earlier MSE analysis in which we found
one concept with a big discrepancy between actual and predicted aggregate share perfectly fits
into this. Please note though that this is evidence, not proof (we found a black swan but not all
swans are black).
177
DISCUSSION AFTER THE CONFERENCE: WHICH METRIC TO USE?
A discussion went on after the conference concerning how to evaluate and compare different
conjoint methods: is it hit rates (as has been habitual in the conjoint world throughout the years)
or is it predicted share of preference of the chosen concept? Hit rates clearly have an advantage
of not being subject to scale issues, whereas shares of preference can be manipulated through a
lower or higher scale factor in the utilities. Though you may well argue, why would anyone
manipulate shares of preference with scale factors to begin with? Predicted shares of preference
on the other hand clearly have an advantage of being a closer representation of the underlying
logistic probability modeling.
Speaking of this model, if the multinomial logistic model specification is correct (i.e., if
answers to holdout tasks were to be regarded as random draws from these utility-based
individual models), we could not even have a significant difference between the hit rate
percentage and the mean share of preference. So reversing this reasoning, since these two
measures deviate so much, apparently either the model specification or the estimation procedure
contains some element of bias. It may be just the scale factor that is biased, or the problem may
be bigger than that. We would definitely recommend some further academic investigation in this
issue.
CONCLUSIONS
In general we were thrilled by some of these findings but less than enthusiastic about others.
On a positive note, it was interesting to see that removing the screening section of ACBC does
not harm its predictive strength. This can be important for studies in which shortening the length
of interview is necessary (i.e., other methods are being used and require the respondent’s
engagement or for cost purposes). While for this test we had a customized version of ACBC
available, Sawtooth announced during the Conference that skipping the screening section as one
of the options will be available to the “general public” in the next SSI Web version (v8.3).
On a similar note, using a more narrow price range for ACBC did not hurt either (though nor
did it beat standard ACBC) despite the fact that doing so creates more multicollinearity in the
design since the total price shown to respondents is more in line with the sum of the concept’s
feature prices. Across all methods, it was encouraging to find that all models beat random choice
by a factor of 3–4, meaning that it is very well possible to predict respondent choice in complex
choice situations such as purchasing one television out of twenty alternatives.
One more double-edged learning is that while none of the ACBC alternatives outperformed
standard ACBC, it just goes to show that ACBC already performs well as it is! In addition since
many of the alternative legs were simplifications of the standard methodology, it shows that the
method is robust enough to support simplifications yet still yield similar results. This may very
well hint that much of the respondent’s non-compensatory choice behavior can be inferred from
their choices between near-neighbors of their BYO concept. After all, the BYO concept is for all
intents and purposes the “ideal” concept for the respondent. We had also hoped that pricebalanced CBC would help to improve standard CBC since it would force respondents to make
trade-offs in areas other than just price, however this did not turn out to be the case.
From our own internal perspective, it was quite promising to see that Dynamic CBC
outperformed both CBC alternatives though again disappointing that it did not quite match the
178
predictive power of the ACBC legs tested. Despite not performing as well as ACBC, Dynamic
CBC could still be a viable methodology to use in cases where something like a BYO exercise or
screening section might not make sense in the context of the market being analyzed. In addition,
further refinement of the method could possibly lead to results as well if not better than ACBC.
Finally we were surprised to see that all of the 8 tested legs were very poor in predicting one
of the concepts in one of the holdout tasks (although the ACBC legs did somewhat less poorly
than the CBC legs). The underlying phenomenon seems to be that brand sensitivity is
overestimated in all legs while price sensitivity is underestimated. This is something definitely to
dig into further, and—who knows—may eventually lead to an entirely new conjoint approach
beyond ACBC or CBC.
NEXT STEPS
As mentioned in the results section, there were some discrepancies between the actual
holdout responses and the conjoint predictions that we would like to further investigate. Given
the promise shown by our initial run of Dynamic CBC, we would like to further test more
variants of it and in other markets as well. As a means of further testing our results, we would
also like to double-check our findings concerning the increased number of cut-points on any past
data that we may have from ACBC studies that included holdout tasks.
Marco Hoogerbrugge
Jeroen Hardon
Christopher Fotenos
REFERENCES
Sawtooth Software (2009), “ACBC Technical Paper,” Sawtooth Software Technical Paper Series,
Sequim, WA
179
RESEARCH SPACE AND REALISTIC PRICING
IN SHELF LAYOUT CONJOINT (SLC)
PETER KURZ1
TNS INFRATEST
STEFAN BINNER
BMS MARKETING RESEARCH + STRATEGY
LEONHARD KEHL
PREMIUM CHOICE RESEARCH & CONSULTING
WHY TALK ABOUT SHELVES?
For consumers, times long ago changed. Rather than being served by a shop assistant, superand hypermarkets have changed the way we buy products, especially fast-moving consumer
goods (FMCGs). In most developed countries there is an overwhelming number of products for
consumers to select from:
“Traditional Trade”
“Modern Trade”
As marketers became aware of the importance of packaging design, assortment and
positioning of their products on these huge shelves, researchers developed methods to test these
new marketing mix elements. One example is a “shelf test” where respondents are interviewed in
front of a real shelf about their reaction to the offered products. (In FMCG work, the products are
often referred to as “stock keeping units” or “SKUs,” a term that emphasizes that each variation
of flavor or package size is treated as a different product.)
For a long time, conjoint analysis was not very good at mimicking such shelves in choice
tasks: early versions of CBC were limited to a small number of concepts to be shown.
Furthermore the philosophical approach for conjoint analysis, let’s call it the traditional conjoint
approach, was driven by taking products apart into attributes and levels.
However, this traditional approach missed some key elements in consumers’ choice situation
in front of a modern FMCG shelf, e.g.:

How does the packaging design of an SKU communicate the benefits (attribute
levels) of a product?
1
Correspondence: Peter Kurz, Head of Research & Development TNS Infratest ([email protected])
Stefan Binner, Managing Director, bms marketing research + strategy ([email protected])
Leonhard Kehl, Managing Director, Premium Choice Research & Consulting ([email protected])
181

How does an SKU perform in the complex competition with the other SKUs in
the shelf?
As it became easy for researchers to create shelf-like choice tasks (in 2013, among Sawtooth
Software users who use CBC, 11% of their CBC projects employed shelf display) a new conjoint
approach developed: “Shelf Layout Conjoint” or “SLC.”
HOW IS SHELF LAYOUT CONJOINT DIFFERENT?
The main differences between Traditional and Shelf Layout Conjoint are summarized in this
chart:
TRADITIONAL CONJOINT
SHELF LAYOUT CONJOINT
- Products or concepts consist usually of defined
attribute levels
- Communication of “attributes” through non-varying
package design (instead of levels)
- More rational or textual concept description
(compared to packaging picture)
- Visibility of all concepts at once
- Including impact of assortment
- Almost no impact of package design
- Usually not too many concepts per task
- Including impact of shelf position and number of
facings
- Information overflow
many attributes—few concepts
 few visible attributes (mainly product and
price—picture only)—many concepts
Many approaches are used to represent a shelf in a conjoint task. Some are very simple:
182
Some are quite sophisticated:
However, even the most sophisticated computerized visualization does not reflect the real
situation of a consumer in a supermarket (Kurz 2008). In that paper, comparisons between a
simple grid of products from which consumers make their choices and attempts to make the
choice exercise more realistic by showing a store shelf in 3D showed no significant differences
in the resulting preference share models.
THE CHALLENGES OF SHELF LAYOUT CONJOINT
Besides differences in the visualization of the shelves, there are different objectives SLCs can
address, including:






pricing
product optimization
portfolio optimization
positioning
layout
promotion
SLCs also differ in the complexity of their models and experimental designs, ranging from
simple main effects models up to complex Discrete Choice Models (DCM’s) with lots of
attributes and parameters to be estimated.
Researchers often run into very complex models, with one attribute with a large number of
levels (the SKU’s) and related to each of these levels one attribute (often, price) with a certain
number of levels. Such designs could easily end up with several hundred parameters to be
estimated. Furthermore, for complex experimental designs, layouts have to be generated in a
special way, in order to retain realistic relationships between SKUs and realistic results. Socalled “alternative-specific designs” are often used in SLC, but that does not necessarily mean
that it is always a good idea to estimate price effects as being alternative-specific. In terms of
estimating utility values (under the assumption you estimate interaction effects, which lead to
alternative-specific price effects), many different coding-schemes can be prepared which are
mathematically identical. But, the experimental design behind the shelves is slightly different.
Different design strategies affect how much level overlap occurs and therefore how efficient the
183
estimation of interactions can be. Good strategies to reduce this complexity in the estimation
stage are crucial.
With Shelf Layout Conjoint now easily available to every CBC user, we would like to
encourage researchers to use this powerful tool. However, there are at least five critical questions
in the design of Shelf Layout Conjoints which need to be addressed:
 Are the research objectives suitable for Shelf Layout Conjoint?
 What is the correct target group and SKU space (the “research space”)?
 Are the planned shelf layout choice tasks meaningful for respondents, and will they
provide the desired information from their choices?
 Can we assure realistic inputs and results with regard to pricing?
 How can we build simulation models that provide reliable and meaningful results?
As this list suggests, there are many topics, problems and possible solutions with Shelf
Layout Conjoint. However, this paper focuses on only three of these very important issues. We
take a practitioner’s, rather than an academic’s point of view. The three key areas we will address
are:
1. Which are suitable research objectives?
2. How to define the research space?
3. How to handle pricing in a realistic way?
SUITABLE RESEARCH OBJECTIVES FOR SHELF LAYOUT CONJOINT
Evaluating suitable objectives requires that researchers be aware of all the limitations and
obstacles Shelf Layout Conjoint has. So, we begin by introducing three of those key limitations.
1. Visualization of the test shelf. Real shelves always look different than test shelves.
Furthermore there is naturally a difference between a 21˝ Screen and a 10 meter shelf in a
real supermarket. The SKUs are shown much smaller than in reality. One cannot touch and feel
them. 3D models and other approaches might help, but the basic issue still remains.
184
2. Realistic choices for consumers. Shelf Layout Conjoint creates an artificial distribution
and awareness: All products are on the shelf; respondents are usually asked to look at or consider
all of them.
In addition, we usually simplify the market with our test shelf. In reality every distribution
chain has different shelf offerings, which might further vary with the size of the individual store.
In the real world, consumers often leave store A and go to store B if they do not like the offering
(shop hopping). Sometimes products are out of stock, forcing consumers to buy something
different.
3. Market predictions from Shelf Layout Conjoint. Shelf Layout Conjoint provides results
from a single purchase simulation. We gain no insights about repurchase (did they like the
product at all?) or future purchase frequency.
In reality, promotions play a big role, not only in the shelf display but in other ways, for
example, with second facings. It is very challenging to measure volumetric promotion effects
such as “stocking up” purchases, but those play a big role in some product categories (Eagle
2010; Pandey, Wagner 2012).
The complexity of “razor and blade” products, where manufacturers make their profit on the
refill or consumable rather than on the basic product or tool, are another example of difficult
obstacles researchers can be faced with.
SUITABLE OBJECTIVES
Despite these limitations and obstacles Shelf Layout Conjoint can provide powerful
knowledge. It is just a matter of using it to address the right objectives; if you use it
appropriately, it works very well!
Usually suitable objectives for Shelf Layout Conjoint fall in the areas of either optimization
of assortment or pricing.
The optimization of assortment most often refers to such issues as:
Line extensions with additional SKUs
185
 What is the impact (share of choice) of the new product?
 Where does this share of choice come from (customer migration and/or
cannibalization)?
 Which possible line extension has the highest consumer preference or leads to the
best overall result for the total brand assortment or product line?
Re-launch or substitution of existing SKUs




What is the impact (share of choice) for the re-launch?
Which possible re-launch alternative has the highest consumer preference?
Which SKU should be substituted for?
How does the result of the re-launch compare to a line extension?
Branding
 What is the effect of re-branding a line of products?
 What is the effect of the market entry of new brands or competitors?
The optimization of pricing most often involves questions like:
Price positioning
 What is the impact of different prices on share of choice and profit?
 How should the different SKUs within an assortment be priced?
 How will the market react to a competitor’s price changes?
Promotions
 What is the impact (sensitivity) to promotions?
 Which SKUs have the highest promotion effect?
 How much price reduction is necessary to create a promotion effect?
Indirect pricing
 What is the impact of different contents (i.e., package sizes) on share of choice and
profit?
 How should the contents of different SKUs within an assortment be defined?
 How will the market react to competitors’ changes in contents?
On the other hand there are research objectives which are problematic, or at least
challenging, for Shelf Layout Conjoint. Some examples include:






186
Market size forecasts for a sales period
Volumetric promotion effects
Multi category purchase/TURF-like goals
Positioning of products on the shelf
Development of package design
Evaluation of new product concepts

New product development
Not all of the above research objectives are impossible, but they at least require very tailored
or cautious approaches.
DEFINITION OF THE CORRECT MARKET SCOPE
By the terminology “market scope” we mean the research space of the Shelf Layout
Conjoint. Market scope can be defined by three questions, which are somewhat related to each
other:
What SKUs do we show on the Shelf?
What consumers do we interview?
What do we actually ask them to do?
=> SKU Space
=> Target Group
=> Context
SKU Space
Possible solutions to this basic problem depend heavily on the specific market and product
category. Two main types of solutions are:
1. Focusing the SKU space on or by market segments
Such focus could be achieved by narrowing the SKU space to just a part of the market such as
- distribution channel (no shop has all SKUs)
- product subcategories (e.g., product types such as solid vs. liquid)
- market segments (e.g., premium or value for money)
This will provide more meaningful results for the targeted market segment. However it may
miss migration effects from (or to) other segments. Furthermore such segments might be quite
artificial from the consumer’s point of view.
Alternatively one could focus on the most relevant products only (80:20 rule).
2. Strategies to cope with too many SKUs
187
When there are more SKUs than can be shown on the screen or understood by the
respondents, strategies might include
- Prior SKU selection (consideration sets)
- Multiple models for subcategories
- Partial shelves
Further considerations for the SKU space:
- Private labels (which SKUs could represent this segment?)
- Out of stock effects (whether and how to include them?)
Target Group
The definition of the target group must align with the SKU space. If there is a market
segment focus, then obviously the target group should only include customers in that segment.
Conversely, if there are strategies for huge numbers of SKUs all relevant customers should be
included.
There are still other questions about the target group which should also be addressed,
including:
- Current buyers only or also non-buyers?
- Quotas on recently used brands or SKUs?
- Quotas on distribution channel?
- Quotas on the purchase occasion?
Context
Once the SKU space and the target group are defined the final element of “market scope” is
to create realistic, meaningful choice tasks:
1. Setting the scene:
- Introduction of new SKUs
- Advertising simulation
2. Visualization of the shelf
- Shelf Layout (brand blocks, multiple facings)
- Line pricing/promotions
- Possibility to enlarge or “examine” products
3. The conjoint/choice question
- What exactly is the task?
- Setting a purchase scenario or not?
- Single choice or volumetric measurement?
PRICING
Pricing is one of the most, if not the most, important topic in Shelf Layout Conjoint. In nearly
all SLCs some kind of pricing issue is included as an objective. But “pricing” does not mean just
one common approach. Research questions in regard to pricing are very different between
different studies. They start with easy questions about the “right price” of a single SKU. They
often include the pricing of whole product portfolios, including different pack sizes, flavors and
188
variants and may extend to complicated objectives like determining the best promotion price and
the impact of different price tags.
Before designing a Shelf Layout Conjoint researchers must therefore have a clear answer to
the question: “How can we obtain realistic input on pricing?”
Realistic pricing does not simply mean that one needs to design around the correct regular
sales price. It also requires a clear understanding of whether the following topics play a role in
the research context.
Topic 1: Market Relevant Pricing
The main issue of this topic is to investigate the context in which the pricing scenario takes
place. Usually such an investigation starts with the determination of actual sales prices. At first
glance, this seems very easy and not worth a lot of time. However, most products are not priced
with a single regular sales price. For example, there are different prices in different sales
channels or store brands. Most products have many different actual sales prices. Therefore one
must start with a closer look at scanner data or list prices of the products in the SKU space.
As a next step, one has to get a clear understanding of the environment in which the product
is sold. Are there different channels like hypermarkets, supermarkets, traders, etc. that have to be
taken into account? In the real world, prices are often too different across channels to be used in
only one Shelf Layout Conjoint design. So we often end up with different conjoint models for
the different channels.
Furthermore, the different store brands may play a role. Store brand A might have a different
pricing strategy because it competes with a different set of SKUs than store brand B. How
relevant are the different private labels or white-label/generic products in the researched market?
In consequence one often ends up with more than one Shelf Layout Conjoint model (perhaps
even dozens of models) for one “simple” pricing context. In such a situation, researchers have to
decide whether to simulate each model independently or to build up a more complex simulator.
This will allow pricing simulations on an overall market level, tying together the large number of
choice models, to construct a realistic “playground” for market simulations.
Topic 2: Initial Price Position of New SKUs
With the simulated launch of new products one has to make prior assumptions about their
pricing before the survey is fielded. Thus, one of the important tasks for the researcher and her
client is to define reasonable price points for the new products in the model. The price range
must be as wide as necessary, but as narrow as possible.
Topic 3: Definition of Price Range Widths and Steps
Shelf Layout Conjoint should cover all possible pricing scenarios that would be interesting
for the client. However, respondents should not be confronted with unrealistically high or low
prices. Such extremes might artificially influence the choices of the respondent and might have
an influence on the measured price elasticity. Unrealistically high price elasticity is usually
caused by too wide a price range, with extremely cheap or extremely expensive prices. One
should be aware that the price range over which an SKU is studied has a direct impact on its
elasticity results! This is not only true for new products, where respondents have no real price
189
knowledge, but also for existing products. Furthermore unrealistically low or high price points
can result in less attention from respondents and more fatigue in the answers of respondents, than
realistic price changes would have caused.
Topic 4: Assortment Pricing (Line Pricing)
Many clients have not just a single product in the market, but a complete line of competing
products on the same shelf. In such cases it is often important to price the products in relation to
each other.
A specific issue in this regard is line pricing: several products of one supplier share the same
price, but differ in their contents (package sizes) or other characteristics. Many researchers
measure the utility of prices independently for each SKU and create line pricing only in the
simulation stage. However, in this situation, it is essential to use line-priced choice tasks in the
interview: respondents’ preference structure can be very different when seeing the same prices
for all products of one manufacturer rather than seeing different prices, which often results in
choosing the least expensive product. This leads to overestimation of preference shares for
cheaper products.
A similar effect can be observed if the relative price separations of products are not respected
in the choice tasks. For example: if one always sells orange juice for 50 cents more than water,
this relative price distance is known or learned by consumers and taken into account when they
state their preference in the choice tasks.
Special pricing designs such as line pricing can be constructed by exporting the standard
design created with Sawtooth Software’s CBC into a CSV format and reworking it in Excel.
However, one must test the manipulated designs afterwards in order to ensure the prior design
criteria are still met. This is done by re-importing the modified design into Sawtooth Software’s
CBC and running efficiency tests.
Topic 5: Indirect Pricing
In markets where most brands offer line pricing the real individual price positioning of SKUs
is often achieved through variation in their package content sizes. This variation can be varied
and modeled in the same way and with the same limitations as monetary prices. However, one
must ensure that the content information is sufficiently visible to the consumers (e.g., written on
price tags or on the product images).
Topic 6: Price Tags
Traditionally, prices in conjoint are treated like an attribute level and are simply displayed
beneath each concept. Therefore in many Shelf Layout Conjoint projects the price tag is simply
the representation of the actual market price and the selected range around it. However, in reality
consumers see the product name, content size, number of applications, price per application or
190
standard unit in addition to the purchase price. (In the European Community, such information is
mandatory by law; in many other places, it is at least customary if not required.)
In choice tasks, many respondents therefore search not only for the purchase price, but also
for additional information about the SKUs in their relevant choice set. Oversimplification of
price tags in Shelf Layout Conjoint does not sufficiently reflect the real decision process.
Therefore, it is essential to include the usual additional information to ensure realistic choice
tasks for respondents.
Topic 7: Promotions
The subject of promotions in Shelf Layout Conjoint is often discussed but controversial. In
our opinion, only some effects of promotions can be measured and modeled in Shelf Layout
Conjoint. SLC provides a one-point-in-time measurement of consumer preference. Thus,
promotion effects which require information over a time period of consumer choices cannot be
measured with SLC. It is essential to keep in mind that we can neither answer the question if a
promotion campaign results in higher sales volume for the client nor make assumptions about
market expansion—we simply do not know anything about the purchase behavior (what and how
much) in the future period.
However, SLC can simulate customers’ reaction to different promotion activities. This
includes the simulation of the necessary price discount in order to achieve a promotion effect,
comparison of the effectiveness of different promotion types (e.g., buy two, get one free) as well
as simulation of competitive reactions, but only at a single point in time. In order to analyze such
promotion effects with high accuracy, we recommend applying different attributes and levels for
the promotional offers from those for the usual market prices. SLC including promotion effects
therefore often have two sets of price parameters, one for the regular market price and one for the
promotional price.
Topic 8: Price Elasticity
Price Elasticity is a coefficient which tells us how sales volume changes when prices are
changed. However, one cannot predict sales figures from SLC. What we get is “share of
preference” or “share of choice” and we know whether more or fewer people are probably
purchasing when prices change.
In categories with single-unit purchase cycles, this is not much of a problem, but in the case
of Fast Moving Consumer Goods (FMCG) with large shelves where consumers often buy more
than one unit—especially under promotion—it is very critical to be precise when talking about
price elasticity. We recommend speaking carefully of a “price to share of preference coefficient”
unless sales figures are used in addition to derive true price elasticity.
191
The number of SKUs included in the research has a strong impact on the “price to share of
preference coefficient.” The fewer SKUs one includes in the model, the higher the ratio; many
researchers experience high “ratios” that are only due to the research design. But they are
certainly wrong if the client wants to know the true “coefficient of elasticity” based on sales
figures.
Topic 9: Complexity of the Model
SLC models are normally far more complex than the usual CBC/DCM models. The basic
structure of SLC is usually one many-leveled SKU attribute and for each (or most) of its levels,
one price attribute. Sometimes there are additional attributes for each SKU such as promotion or
content. As a consequence there are often too many parameters to be estimated in HB.
Statistically, we have “over-parameterization” of the model. However there are approaches to
reduce the number of estimated parameters, e.g.:





Do we need part-worth estimates for each price point?
Could we use a linear price function?
Do we really need price variation for all SKUs?
Could we use fixed price points for some competitors’ SKUs?
Could we model different price effects by price tiers (such as low, mid, high) instead
of one price attribute per SKU?
Depending on the quantity of information one can obtain from a single respondent, it may be
better to use aggregate models than HB models. The question is, how many tasks could one
really ask of a single respondent before reaching her individual choice task threshold, and how
many concepts could be displayed on one screen (Kurz, Binner 2012)? If it’s not possible to
show respondents a large enough number of choice tasks to get good individual information,
relative to the large number of parameters in an SLC model, HB utility estimates will fail to
capture much heterogeneity anyway.
TOPICS BEYOND THIS PAPER
How can researchers further ensure that Shelf Layout Conjoint provides reliable and
meaningful results? Here are some additional topics:







Sample size and number of tasks
Block designs
Static SKUs
Maximum number of SKUs on shelf
Choice task thresholds
Bridged models
Usage of different (more informative) priors in HB to obtain better estimates
EIGHT KEY TAKE-AWAYS FOR SLC
1. Be aware of its limitations when considering Shelf Layout Conjoint as a methodology for
your customers’ objectives. One cannot address every research question.
192
2. Try hard to ensure that your pricing accurately reflects the market reality. If one model is
not possible, use multi-model simulations or single market segments.
3. Be aware that the price range definition for a SKU has a direct impact on its elasticity
results and define realistic price ranges with care.
4. Adapt your design to the market reality (e.g., line pricing), starting with the choice tasks
and not only in your simulations.
5. Do not oversimplify price tags in Shelf Layout Conjoint; be sure to sufficiently reflect the
real decision environment.
6. SLC provides just a one point measurement of consumer preference. Promotion effects
that require information about a time period of consumer choices cannot be measured.
7. Price elasticity derived from SLC is better called “price to share of preference
coefficients.”
8. SLC often suffers from “over-parameterization” within the model. One should evaluate
different approaches to reduce the number of estimated parameters.
Peter Kurz
Stefan Binner
Leonhard Kehl
REFERENCES
Eagle, Tom (2010): Modeling Demand Using Simple Methods: Joint Discrete/Continuous
Modeling; 2010 Sawtooth Software Conference Proceedings.
Kehl, Leonhard; Foscht, Thomas; Schloffer, Judith (2010): Conjoint Design and the impact
on Price-Elasticity and Validity; Sawtooth Europe Conference Cologne.
Kurz, Peter; Binner, Stefan (2012): “The Individual Choice Task Threshold” Need for Variable
Number of Choice Tasks; 2012 Sawtooth Software Conference Proceedings.
Kurz, Peter (2008): A comparison between Discrete Choice Models based on virtual shelves
and flat shelf layouts; SKIM Working Towards Symbioses Conference Barcelona.
Orme, Bryan (2003): Special Features of CBC Software for Packaged Goods and Beverage
Research; Sawtooth Software, via Website.
193
Pandey, Rohit; Wagner, John; Knappenberger, Robyn; (2012): Building Expandable
Consumption into a Share-Only MNL Model; 2012 Sawtooth Software Conference
Proceedings.
194
ATTRIBUTE NON-ATTENDANCE IN DISCRETE CHOICE EXPERIMENTS
DAN YARDLEY
MARITZ RESEARCH
EXECUTIVE SUMMARY
Some respondents ignore certain attributes in choice experiments to help them choose
between competing alternatives. By asking the respondents which attributes they ignored and
accounting for this attribute non-attendance we hope to improve our preference models. We test
different ways of asking stated non-attendance and the impact of non-attendance on partial
profile designs. We also explore an empirical method of identifying and accounting for attribute
non-attendance.
We found that accounting for stated attribute non-attendance does not improve our models.
Identifying and accounting for non-attendance empirically can result in better predictions on
holdouts.
BACKGROUND
Recent research literature has included discussions of “attribute non-attendance.” In the case
of stated non-attendance (SNA) we ask respondents after they answer their choice questions,
which attributes, if any, they ignored when they made their choices. This additional information
can then be used to zero-out the effect of the ignored attributes. Taking SNA into account
theoretically improves model fit.
We seek to summarize this literature, to replicate findings using data from two recent choice
experiments, and to test whether taking SNA into account improves predictions of in-sample and
out-of-sample holdout choices. We also explore different methods of incorporating SNA into our
preference model and different ways of asking SNA questions. In addition to asking different
stated non-attendance questions, we will also compare SNA to stated attribute level ratings (two
different scales tested) and self-explicated importance allocations.
Though it’s controversial, some researchers have used latent class analysis to identify nonattendance analytically (Hensher and Greene, 2010). We will attempt to identify non-attendance
by using HB analysis and other methods. We will determine if accounting for “derived nonattendance” improves aggregate model fit and holdout predictions. We will compare “derived
non-attendance” to the different methods of asking SNA and stated importance.
It’s likely that non-attendance is more of an issue for full profile than for partial profile
experiments, another hypothesis our research design will allow us to test. By varying the
attributes shown, we would expect respondents to pay closer attention to which attributes are
presented, and thus ignore fewer attributes.
STUDY 1
We first look at the data from a tablet computer study conducted in May 2012. Respondents
were tablet computer owners or intenders. 502 respondents saw 18 full profile choice tasks with
3 alternatives and 8 attributes. The attributes and levels are as follows:
195
Attribute
Level 1
Operating System Apple
Memory 8 GB
Included Cloud Storage None
Price $199
Camera Picture Quality 0.3 Megapixels
Warranty 3 Months
Screen Size 5"
High definition display
Screen Resolution
(200 pixels per inch)
Level 2
Android
16 GB
5 GB
$499
2 Megapixels
1 Year
7"
Extra-high definition display
(300 pixels per inch)
Level 3
Windows
64 GB
50 GB
$799
5 Megapixels
3 Years
10"
Another group of 501 respondents saw 16 partial profile tasks based upon the same attributes
and levels as above. They saw 4 attributes at a time. 300 more respondents completed a
differently formatted set of partial profile tasks and were designated for use as an out-of-sample
holdout. All respondents saw the same set of 6 full profile holdout tasks and a stated nonattendance question. The stated non-attendance question and responses were:
Please indicate which of the attributes, if any, you ignored when you made your choices in
the preceding questions:
Memory
Operating System
Screen Resolution
Price
Screen Size
Warranty
Camera Picture Quality
Included Cloud Storage
I did not ignore any of these
Average # Ignored
Full
Profile
17.5%
20.9%
22.7%
26.9%
27.3%
27.9%
30.1%
36.7%
26.1%
Partial
Profile
14.2%
18.6%
14.0%
16.4%
17.2%
24.2%
21.8%
28.9%
36.3%
2.10
1.55
Full
Profile
8
7
6
5
4
3
2
1
Partial
Profile
7
4
8
6
5
2
3
1
We can see that the respondents who saw partial profile choice tasks answered the stated nonattendance question differently from those who saw full profile. The attribute rankings are
different for partial and full profile, and 6 of the 9 attribute frequencies are significantly different
(bold). Significantly more partial profile respondents stated that they did not ignore any of the
attributes. From these differences we conclude that partial profile respondents pay closer
attention to which attributes are showing. This is due to the fact that the attributes shown vary
from task to task.
We now compare aggregate (pooled) multinomial logistic models created using data
augmented with the stated non-attendance (SNA) question and models with no SNA. The way
we account for the SNA question is: if the respondent said they ignored the attribute, we zero out
196
the attribute in the design, effectively removing the attribute for that respondent. Accounting for
SNA in this manner yields mixed results.
Holdout Tasks
Full Profile
56.7%
56.9%
No SANA
With SANA
Likelihood Ratio
Full Profile
No SANA
1768
With SANA
2080
Partial
53.1%
51.6%
Partial
2259
2197
Out of Sample
Full Profile
47.7%
52.0%
No SANA
With SANA
Partial
55.4%
55.4%
For the full profile respondents, we see slight improvement in Holdout Tasks hit rates from
56.7% to 56.9% (applying the aggregate logit parameters to predict individual holdout choices).
Out-of-sample hit rates (applying the aggregate logit parameters from the training set to each
out-of-sample respondent’s choices) and Likelihood Ratio (2 times the difference in the model
LL compared to the null LL) also show improvement. For partial profile respondents, accounting
for the SNA did not lead to better results. It should be noted that partial profile performed better
on out-of-sample hit rates. This, in part, is due to the fact that the out-of-sample tasks are partial
profile. Similarly, the full profile respondents performed better on holdout task hit rates due to
the holdout tasks being full profile tasks.
Looking at the resulting parameter estimates we see little to no differences between models
that account for SNA (dashed lines) and those that don’t. We do, however, see slight differences
between partial profile and full profile.
2
1.5
1
0.5
0
FP
FP SNA
-0.5
PP
-1
PP SNA
-1.5
OS
Memory
Cloud
Storage
Price
Screen
Res
Camera
(Mpx)
Warranty
10"
7"
5"
3Yr
1Yr
3Mo
5.0
2.0
0.3
XHD
HD
$799
$499
$199
50GB
5GB
0GB
64GB
16GB
8GB
Win
And
App
-2
Screen Size
Shifting from aggregate models to Hierarchical Bayes, again we see that accounting for SNA
does not lead to improved models. In addition to accounting for SNA by zeroing out the ignored
attributes in the design (Pre), we also look at zeroing out the resulting HB parameter estimates
197
(Post). Zeroing out HB parameters (Post) has a negative impact on holdout tasks. Not accounting
for SNA resulted in better holdout task hit rates and root likelihood (RLH).
Holdout Tasks
RLH
Full Profile
No SNA
72.8%
Pre SNA
71.1%
Post SNA
66.6%
Partial
63.2%
61.1%
59.4%
Full Profile
No SNA
0.701
With SNA
0.633
Partial
0.622
0.565
Lots of information is obtained by asking respondents multiple choice tasks. With all this
information at our fingertips, is it really necessary to ask additional questions about attribute
attendance? Perhaps we can derive attribute attendance empirically, and improve our models as
well. One simple method of calculating attribute attendance that we tested is to compare each
attribute’s utility range from the Hierarchical Bayes models to the attribute with the highest
utility range. Below are the utility ranges for the first three full profile respondents from our data.
Utilities Range
ID
6
8
12
Operating
System
2.73
0.14
3.07
Memory
1.89
0.64
0.64
Cloud
Storage
1.52
0.04
0.16
Price
1.82
9.87
1.36
Screen
Camera
Resolution Megapixels Warranty Screen Size
1.01
2.65
0.84
1.13
0.02
0.68
0.38
0.51
0.12
1.60
0.91
0.94
With a utility range for Price of 9.87 and all other attribute ranges of less than 1, we can
safely say that the respondent with ID 8 did not ignore Price. The question becomes, at what
point are attributes being ignored? We analyze the data at various cut points. For each cut point
we find the attribute for each individual with the largest range and then assume everything below
the cut point is ignored by the individual. For example, if the utility range of Price is the largest
at 9.87, a 10% cut point drops all attributes with a utility range of .98 and smaller. Below we
have identified for our three respondents (grayed out) the attributes they empirically ignored at
the 10% cut point. At this cut point we would say the respondent ID 6 ignored none of the
attributes, ID 8 ignored 7, and ID 12 ignored 2 of the 8 attributes.
Utilities Range
ID
6
8
12
Operating
System
2.73
0.14
3.07
10% Cut Point
Memory
1.89
0.64
0.64
Cloud
Storage
1.52
0.04
0.16
Price
1.82
9.87
1.36
Screen
Camera
Resolution Megapixels Warranty Screen Size
1.01
2.65
0.84
1.13
0.02
0.68
0.38
0.51
0.12
1.60
0.91
0.94
We now analyze different cut points to see if we can improve the fit and predictive ability of
our models, and to find an optimal cut point.
198
Average Hit Rate
Empirical Attribute Non Attendance
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Internal - FP
Internal - PP
Hold Out - FP
Hold Out - PP
None
10%
20%
30%
40%
50%
Cutting Point Below Max
We see that as the cut point increases so does the hit rate on the holdout tasks. An optimal cut
point is not obvious. We will use a 10% cut point as a conservative look and 45% as an
aggressive cut point. Looking further at the empirical 10% and 45% cut points in the table below,
we see that for full profile we get an improved mean absolute error (MAE). For partial profile the
MAE stays about the same. Empirically accounting for attribute non-attendance improved our
models while accounting for stated non-attendance did not.
MAE
No SNA
SNA
Emp 10
Emp 45
Full Profile
0.570
0.686
0.568
0.534
Partial
1.082
1.109
1.082
1.089
From this first study we see that respondents pay more attention to partial profile type
designs and conclude that, accounting for stated non-attendance does not improve our models.
We wanted to further explore these findings and test our empirical methods so we conducted a
second study.
STUDY 2
The second study we conducted was a web-based survey fielded in March 2013. The topic of
this study was “significant others” and included respondents that were interested in finding a
spouse, partner, or significant other. We asked 2,000 respondents 12 choice tasks with 8
attributes and 5 alternatives. We also asked all respondents the same 3 holdout tasks and a 100
point attribute importance allocation question. Respondents were randomly assigned to 1 of 4
cells of a 2x2 design. The 2x2 design was comprised of 2 different stated non-attendance
questions, and 2 desirability scale rating questions. The following table shows the attributes and
levels considered for this study.
199
Attribute
Attractiveness
Romantic/Passionate
Honesty/Loyalty
Funny
Intelligence
Political Views
Religious Views
Annual Income
Levels
Not Very Attractive
Not Very Romantic/Passionate
Can't Trust
Not Funny
Not Very Smart
Strong Republican
Christian
$15,000
$40,000
Somewhat Attractive
Somewhat Romantic/Passionate
Mostly Trust
Sometimes Funny
Pretty Smart
Swing Voter
Religious - Not Christian
$65,000
$100,000
Very Attractive
Very Romantic/Passionate
Completely Trust
Very Funny
Brilliant
Strong Democrat
No Religion/Secular
$200,000
We wanted to test different ways of asking attribute non-attendance. Like our first study, we
asked half the respondents which attributes they ignored during the choice tasks: “Which of these
attributes, if any, did you ignore in making your choice about a significant other?” The other half
of respondents were asked which attributes they used: “Which of these attributes did you use in
making your choice about a significant other?”
For each attribute in the choice study, we asked each respondent to rate all levels of the
attribute on desirability. Half of the respondents were asked a 1–5 Scale the other half a 0–10
Scale. Below are examples of these rating questions for the Attractiveness attribute.
1–5 Scale
Still thinking about a possible significant other, how desirable is each of the following levels of
their Attractiveness?
Completely Not Very
Highly Extremely No Opinion/Not
Desirable
Unacceptable Desirable
Desirable Desirable
Relevant
(1)
(2)
(3)
(5)
(4)
Not Very Attractive
Somewhat Attractive
Very Attractive
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
0–10 Scale
Still thinking about a possible significant other, how desirable is each of the following levels of their Attractiveness?
Not Very Attractive
Somewhat Attractive
Very Attractive
Extremely
Undesireable
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
Extremely No Opinion/Not
Desirable
Relevant
(10)
(9)
m
m
m
m
m
m
m
m
m
As previously mentioned, all respondents were asked an attribute importance allocation
question. This question was asked of all respondents so that we could use it as a base measure for
comparisons. The question asked: “For each of the attributes shown, how important is it for your
significant other to have the more desirable versus less desirable level? Please allocate 100 points
among the attributes, with the most important attributes having the most points. If an attribute
has no importance to you at all, it should be assigned an importance of 0.”
We now compare the results for the different non-attendance methods. All methods easily
identify Honesty/Loyalty as the most influential attribute. 86.8% of respondents asked if they
used Honesty/Loyalty in making their choice marked they did. Annual Income is the least
influential attribute for the stated questions, and second-to-least for the allocation question. Since
200
Choice, 1–5 Scale and 0–10 Scale are derived, they don’t show this same social bias, and Annual
Income ranks differently. Choice looks at the average range of the attribute’s HB coefficients.
Similarly, 1–5 and 0–10 Scale looks at the average range of the ranked levels. For the 0–10
Scale, the average range between “Can’t Trust” and “Completely Trust” is 8.7.
Stated
Used
Honesty/Loyalty 86.8%
Funny 53.7%
Stated
Ignored
8.1%
17.4%
1-5
Scale
3.44
2.11
0-10
Scale Allocation Choice
8.70
31.3%
5.93
5.42
10.0%
1.20
Intelligence 52.1%
17.0%
1.81
4.75
10.6%
1.24
48.7%
44.7%
39.3%
28.0%
20.1%
15.4%
25.6%
26.7%
35.9%
44.2%
2.06
1.81
0.88
0.33
1.88
5.12
4.18
2.12
1.11
4.22
12.5%
13.8%
9.7%
5.3%
6.8%
0.96
1.44
0.82
0.50
1.63
Attribute Rankings Stated
Used
Honesty/Loyalty
1
Funny
2
Intelligence
3
Romantic/Passionate
4
Attractiveness
5
Religious Views
6
Political Views
7
Annual Income
8
Stated
Ignored
1
4
3
2
5
6
7
8
1-5
Scale
1
2
5
3
6
7
8
4
0-10
Scale Allocation Choice
1
1
1
2
5
5
4
4
4
3
3
6
6
2
3
7
6
7
8
8
8
5
7
2
Romantic/Passionate
Attractiveness
Religious Views
Political Views
Annual Income
In addition to social bias, another deficiency with Stated Used and Stated Ignored questions
is that some respondents don’t check all the boxes they should, thus understating usage of
attributes for Stated Used, and overstating usage for Stated Ignored. The respondents, who were
asked which attributes they used, selected on average 3.72 attributes. The average number of
attributes that respondents said they ignored is 1.92, implying they used 6.08 of the 8 attributes.
If respondents carefully considered these questions, and truthfully marked all that applied, the
difference between Stated Used and Stated Ignored would be smaller.
For each type of stated non-attendance question, we analyzed each attribute for all
respondents and determined if the attribute was ignored or not. The following table shows the
differences across the methods of identifying attribute non-attendance. For the allocation
question, if the respondent allocated 0 of the 100 points, the attribute was considered ignored.
For 1–5 Scale and 0–10 Scale if the range of the levels of the attribute was 0, the attribute was
considered ignored. The table below shows percent discordance between methods. For example,
the Allocate method identified 35.1% incorrectly from the Stated Used method. Overall, we see
201
the methods differ substantially one from another. The most diverse methods are the Empirical
10% and the Stated Used with 47.6% discordance.
% Discordance Stated Stated
Used Ignore
Allocate 35.1% 24.9%
Stated Used
Stated Ignore
1-5 Scale
0-10 Scale
1-5
Scale
22.3%
39.3%
23.2%
0-10
Scale Emp 10%
21.3%
28.8%
41.6%
47.6%
22.2%
27.9%
23.6%
22.4%
Accounting for these diverse methods of attribute non-attendance in our models, we can see
the impact on holdout hit rates is small, and only the 1–5 Scale shows improvement over the
model without stated non-attendance.
Hit Rates
No SNA
Holdout 1 55.6%
Holdout 2 57.1%
Holdout 3 61.8%
Average 58.2%
Allocate
55.2%
58.1%
61.3%
Stated
Used
51.4%
54.1%
57.6%
Stated
Ignore
56.4%
54.9%
62.1%
1-5
Scale
58.8%
57.0%
61.5%
0-10
Scale
52.8%
56.5%
61.2%
58.2% 54.3% 57.8% 59.1% 56.8%
When we account for attribute non-attendance empirically, using the method previously
described, we see improvement in the holdouts. Again we see, in the table below, that as we
increase the cut point, zeroing out more attributes, the holdouts increase. Instead of accounting
for attribute non-attendance by asking additional questions we can efficiently do so empirically
using the choice data.
Hit Rates
Cut Point None
Holdout 58.2%
Internal 91.0%
10%
58.7%
90.3%
20%
59.0%
86.5%
30%
62.4%
81.6%
40%
64.8%
76.8%
50%
67.0%
73.5%
CONCLUSIONS
When asked: “Please indicate which of the attributes, if any, you ignored when you made
your choices in the preceding questions” respondents previously asked discrete choice questions
of a partial profile type design, indicated they ignored fewer attributes than those asked full
profile choice questions. Partial profile type designs solicit more attentive responses than full
profile designs. This, we believe, is due to the fact that the attributes shown change, demanding
more of the respondent’s attention.
Aggregate and Hierarchical Bayes Models typically do not perform better when we account
for stated attribute non-attendance. Accounting for questions directly asking which attributes
were ignored performs better than asking which attributes were used. Derived methods eliminate
the social bias of direct questions and, when accounted for in the models, tend to perform better
than the direct questions. Respondents use a different thought process when answering stated
202
attribute non-attendance questions than choice tasks. Combing different question types pollutes
the interpretation of the models and is discouraged.
A simple empirical way to account for attribute non-attendance is to look at the HB utilities
range and zero out attributes with relatively small ranges. Models where we identify nonattendance empirically perform better on holdouts and benefit from not needing additional
questions.
Dan Yardley
PREVIOUS LITERATURE
Alemu, M.H., M.R Morkbak, S.B. Olsen, and C.L Jensen (2011) “Attending to the reasons for
attribute non-attendance in choice experiments,” FOI working paper, University of
Copenhagen.
Balcombe, K., M. Burton and D. Rigby (2011) “Skew and attribute non-attendance within the
Bayesian mixed logit model,” paper presented at the International Choice Modeling
Conference.
Cameron, T.A. and J.R. DeShazo (2010) “Differential attention to attributes in utility-theoretic
choice models,” Journal of Choice Modeling, 3: 73–115.
Campbell, D., D.A. Hensher and R. Scarpa (2011) “Non-attendance to attributes in
environmental choice analysis: a latent class specification,” Journal of Environmental
Planning and Management, 54, 2061–1076.
Hensher, D.A. and A.T. Collins (2011) “Interrogation of responses to stated choices experiments:
is there sense in what respondents tell us?” Journal of Choice Modeling, 4: 62–89.
Hensher, D.A. and W.H. Greene (2010) “Non-attendance and dual processing of common-metric
attributes in choice analysis: A Latent Class Specification,” Empirical Economics, 39, 413–
426.
Hensher, D.A. and J. Rose (2009) “Simplifying choice through attribute preservation or nonattendance: implications for willingness to pay,” Transportation Research E, 45, 583–590.
Hensher, D.A., J. Rose and W.H. Greene (2005) “The implications on willingness to pay of
respondents ignoring specific attributes,” Transportation, 32, 203–222.
203
Scarpa, R., T.J. Gilbride, D. Campbell, D. and D.A. Hensher (2009) “Modelling attribute nonattendance in choice experiments for rural landscape valuation,” European Review of
Agricultural Economics, 36, 151–174.
Scarpa, R., R. Raffaelli, S. Notaro and J. Louviere (2011) “Modelling the effects of stated
attribute non-attendance on its inference: an application to visitors benefits from the alpine
grazing commons,” paper presented at the International Choice Modeling Conference.
204
ANCHORED ADAPTIVE MAXDIFF:
APPLICATION IN CONTINUOUS CONCEPT TEST
ROSANNA MAU
JANE TANG
LEANN HELMRICH
MAGGIE COURNOYER
VISION CRITICAL
SUMMARY
Innovative firms with a large number of potential new products often set up continuous
programs to test these concepts in waves as they are developed. The test program usually
assesses these concepts using monadic or sequential monadic ratings. It is important that the
results be comparable not just within each wave but across waves as well. The results of all the
testing are used to build a normative database and select the best ideas for implementation.
MaxDiff is superior to ratings, but is not well suited for tracking across the multiple waves of
a continuous testing program. This can be addressed by using an Anchored Adaptive MaxDiff
approach. The use of anchoring transforms relative preferences into an absolute scale, which is
comparable across waves. Our results show that while there are strong consistencies between the
two methodologies, the concepts are more clearly differentiated through their anchored MaxDiff
scores. Concepts that were later proven to be successes also seemed to be more clearly identified
using the Anchored approach. It is time to bring MaxDiff into the area of continuous concept
testing.
1. INTRODUCTION
Concept testing is one of the most commonly used tools for new product development. Firms
with a large number of ideas to test usually have one or more continuous concept test programs.
Rather than testing a large number of concepts in one go, a small group of concepts are
developed and tested at regular intervals. Traditional monadic or sequential monadic concept
testing methodology—based on rating scales—is well suited for this type of program.
Respondents rate each concept one at a time on an absolute scale, and the results can be
compared within each wave and across waves as well. Over time, the testing program builds up a
normative database that is used to identify the best candidates for the next stage of development.
To ensure that results are truly comparable across waves, the approach used in all of the
waves must be consistent. Some of the most important components that must be monitored
include:
Study design—A sequential monadic set up is often used for this type of program. Each
respondent should be exposed to a fixed number of concepts in each wave. The number of
respondents seeing each concept should also be approximately the same.
Sample design and qualification—The sample specification and qualifying criteria should be
consistent between waves. The source of sample should also remain stable. River samples and
router samples suffer from lack of control over sample composition and therefore are not suitable
205
for this purpose. Network samples, where the sample is made up of several different panel
suppliers, need to be controlled so that similar proportions come from each panel supplier each
time.
Number and format of the concept tested—The number of concepts tested should be similar
between waves. If a really large number of items need to be tested, new waves should be added.
The concepts should be at about the same stage of concept development. The format of the
concepts, for example, image with a text description, should also remain consistent.
Questionnaire design and reporting—In a sequential monadic set-up, respondents are randomly
assigned to evaluate a fixed number of concepts and they see one concept at a time. The order in
which the concepts are presented is randomized. Respondents are asked to evaluate the concept
using a scale to rate their “interest” or “likelihood of purchase.” In the reporting stage, the key
reporting statistic used to assess the preference for concepts is determined. This could be, for
example, the top 2 box ratings on a 5-point Likert scale of purchase interest. This reporting
statistic, once determined, should remain consistent across waves.
In this type of sequential monadic approach, each concept is tested independently. Over time,
the key reporting statistics are compiled for all of the concepts tested. This allows for the
establishment of norms and action standards—the determination of “how good is good?” Such
standards are essential for identifying the best ideas for promotion to the next stage of product
development. However, it is at this stage that difficulties often appear. A commonly encountered
problem is rating scale statistics offer only a very small differentiation between concepts. If the
difference between the second quartile and the top decile is 6%—while the margin of error is
around 8%—how can we really tell if a concept is merely good, or if it is excellent? Traditional
rating scales do not always allow us to clearly identify which concepts are truly preferred by
potential customers. Another method is needed to establish respondent preference. MaxDiff
seems to be an obvious choice.
2. MAXDIFF
MaxDiff methodology has shown itself to be superior to ratings (Cohen 2003; Chrzan &
Golovashkina 2006). Instead of using a scale, respondents are shown a subset of concepts and
asked to choose the one they like best and the one they like least. The task is then repeated
several times according to a statistical design. There is no scale-use bias. The tradeoffs
respondents make during the MaxDiff exercise not only reveal their preference heterogeneity, but
also provide much better differentiation between the concepts.
However, MaxDiff results typically reveal only relative preferences—how well a concept
performs relative to other concepts presented at the same time. The best-rated concept from one
wave of testing might not, in fact, be any better than a concept that was rated as mediocre in a
different wave of testing. How can we assure the marketing and product development teams that
the best concept is truly a great concept rather than simply “the tallest pygmy”?
Anchoring
Converting the relative preferences from MaxDiff studies into absolute preference requires
some form of anchoring. Several attempts have been made at achieving this. Orme (2009) tested
206
and reported the results from dual-response anchoring proposed by Jordan Louviere. After each
MaxDiff question, an additional question was posed to the respondent:
Considering just these four features . . .
 All four are important
 None of these four are important
 Some are important, some are not
Because this additional question was posed after each MaxDiff task, it added significantly
more time to the MaxDiff questionnaire.
Lattery (2010) introduced the direct binary response (DBR) method. After all the MaxDiff
tasks have been completed, respondents were presented with all the items in one list and asked to
check all the items that appealed to them. This proved to be an effective way of revealing
absolute preference—it was simpler and only required one additional question. Horne et al.
(2012) confirmed these advantages of the direct binary method, but also demonstrated that it was
subject to context effects; specifically, the number of items effect. If the number of items is
different in two tests, for example, one has 20 items and another has 30 items, can we still
compare the results?
Making It Adaptive
Putting aside the problem of anchoring for now, consider the MaxDiff experiment itself.
MaxDiff exercises are repetitive. All the tasks follow the same layout, for example, picking the
best and the worst amongst 4 options. The same task is repeated over and over again. The
exercise can take a long time when a large number of items need to be tested. For example, if
there are 20 items to test, it can take 12 to 15 tasks per respondent to collect enough information
for modeling.
As each item gets the same number of exposures, the “bad” items get just as much attention
as the “good” items. Respondents are presented with subsets of items that seem random to them.
They cannot see where the study is going (especially if they see something they have already
rejected popping up again and again) and can become disengaged.
The adaptive approach, first proposed by Bryan Orme in 2006, is one way to tackle this
issue. Here is an example of a 20-item Adaptive MaxDiff exercise:
The process works much like an athletic tournament. In stage 1, we start with 4 MaxDiff
tasks with 5 items per task. The losers in stage 1 are discarded, leaving 16 items in stage 2. The
losers are dropped out after each stage. By stage 4, 8 items are left, which are evaluated in 4
pairs. In the final stage, we ask the respondents to rank the 4 surviving items. Respondents can
207
see where the exercise is going. The tasks are different, and so less tedious; and respondents find
the experience more enjoyable overall (Orme 2006).
At the end of Adaptive MaxDiff exercise, the analyst has individual level responses showing
which items are, and which are not, preferred. Because the better items are retained into the later
stages and therefore have more exposures, the better items are measured more precisely than the
less preferred items. This makes sense—often we want to focus on the winners. Finally, the
results from Adaptive MaxDiff are consistent with the results of traditional MaxDiff (Orme
2006).
Anchored Adaptive MaxDiff
Anchored Adaptive MaxDiff combines the adaptive MaxDiff process with the direct binary
anchoring approach. However, since the direct binary approach is sensitive to the number of
items included in the anchoring question, we want to fix that number. For example, regardless of
the number of concepts tested, the anchoring question may always display 6 items. Since this
number is fixed, all waves must have at least that number of concepts for testing. Probably, most
waves will have more. Which leads to this question: which items should be included in the
anchoring question? To provide an anchor that is consistent between waves, we want to include
items that span the entire spectrum from most preferred to least preferred. The Adaptive MaxDiff
process gives us that. In the 20 item example shown previously, these items could be used in the
anchoring:
Ranked Best from Stage 5
Ranked Worst from Stage 5
Randomly select one of the discards from stage 4
Randomly select one of the discards from stage 3
Randomly select one of the discards from stage 2
Randomly select one of the discards from stage 1
None of the above
The Anchored Adaptive MaxDiff approach has the following benefits:
•
•
•
•
A more enjoyable respondent experience.
More precise estimates for the more preferred items—which are the most important
estimates.
No “number of items” effect through the “controlled” anchoring question.
Anchored (absolute) preference.
The question remains whether the results from multiple Anchored Adaptive MaxDiff
experiments are truly comparable to each other.
3. OUR EXPERIMENT
While the actual concepts included in continuous testing programs vary, the format and the
number of concepts tested are relatively stable. This allows us to set up the Adaptive MaxDiff
exercise and the corresponding binary response anchoring question and use the same nearly
identical structure for each wave of testing.
208
Ideally, the anchored MaxDiff methodology would be tested by comparing its results to an
independent dataset of results obtained from scale ratings. That becomes very costly, especially
given the multiple wave nature of the process we are interested in. To save money, we collected
both sets of data at the same time. An Anchored Adaptive exercise was piggybacked onto the
usual sequential monadic concept testing which contained the key purchase intent question.
We tested a total of 126 concepts in 5 waves, approximately once every 6 months. The
number of concepts tested in each wave ranged from 20 to 30. Respondents completed the
Anchored Adaptive MaxDiff exercise first. They were then randomly shown 3 concepts for
rating in a sequential monadic fashion. Each concept was rated on purchase intent, uniqueness,
etc. Respondents were also asked to provide qualitative feedback on that concept.
The overall sample size was set to ensure that each concept received 150 exposures.
Wave
1
2
3
4
Number of
concepts
25
20
21
30
5
Total
30
126
Field Dates
Sample Size
Spring 2011
Fall 2011
Spring 2012
Fall 2012
1,200
1,000
1,000
1,500
Spring 2013
1,500
6,200
Including 5 known “star” concepts
With 2 "star" concepts repeated
In wave 4, we included five known “star” concepts. These were concepts based on existing
products with good sales history—they should test well. Two of the star concepts were then
repeated in wave 5. Again, they should test well. More importantly, they should receive
consistent results in both waves.
The flow of the survey is shown in the following diagram:
The concept tests were always done sequentially, with MaxDiff first and the sequential
monadic concept testing questions afterwards. Thus, the concept test results were never “pure
209
and uncontaminated.” It is possible the MaxDiff exercise might have influenced the concept test
results. However, as MaxDiff exposed all the concepts to all the respondents, we introduced no
systematic biases for any given concept tested.
The results of this experiment are shown here:
The numbers on each line identify the concepts tested in that wave. Number 20 in wave one
is different from number 20 in wave 2—they are different concepts with the same id number. The
T2B lines are the top-2-box Purchase Intent, i.e., the proportion of respondents who rated the
concept as “Extremely likely” or “Somewhat likely” to purchase and is based on approximately
n=150 per concept for rating scale evaluation.
The AA-MD lines are the Anchored Adaptive MaxDiff (AA-MD) results. We used Sawtooth
Software CBC/HB in the MaxDiff estimation and the pairwise coding for the anchoring question
as outlined in Horne et al. (2012). We plot the simulated probability of purchase, i.e., it is the
exponentiated beta for a concept divided by the sum of exponentiated beta of that concept and
exponentiated beta of the anchor threshold. The numbers are the average across respondents. It
can be interpreted as the average likelihood of purchase for each concept.
While the Anchored Adaptive MaxDiff results are generally slightly lower than the T2B
ratings, there are many similarities. The AA-MD results have better differentiation than the
purchase intent ratings. Visually, there is less bunching together, especially among the better
concepts tested.
Note that while we used T2B score here, we also did a comparison using the average
weighted purchase intent. With a 5-point purchase intent scale, we used weighting factors of
0.7/0.3/0.1/0/0. The results were virtually identical. The correlation between T2B score and
weighted purchase intent was 0.98; the weighted purchase intent numbers were generally lower.
Below is a scatter plot of AA-MD scores versus T2B ratings. There is excellent consistency
between the two sets of numbers, suggesting that they are essentially measuring the same
210
construct. Since T2B ratings can be used in aggregating results across waves, MaxDiff scores
can be used for that as well.
We should note the overlap of the concepts (in terms of preferences by either measure)
among the 5 waves. All the waves have some good/preferred concepts and some bad/less
preferred concepts. No wave stands out as being particularly good or bad. This is not surprising
in a continuous concept testing environment as the ideas are simply tested as they come along
without any prior manipulation. We can say that there is naturally occurring randomization in the
process that divides the concepts into waves. We would not expect to see one wave with only
good concepts, and another with just bad ones. This may be one of the reasons why AA-MD
works here. If the waves contain similarly worthy concepts, the importance of anchoring is
diminished.
To examine the role of the anchor, we reran the results without the anchoring question in the
MaxDiff estimation. Anchoring mainly helps us to interpret the MaxDiff results in an “absolute”
sense so that Adaptive Anchoring-MaxDiff scores represent the average simulated probability of
purchase. Anchoring also helps marginally in improving consistency with the T2B purchase
intent measure across waves.
Alpha with T2B (%) Purchase Intent
n=126 concepts
AA MD scores
with Anchoring
Without Anchoring
0.94
0.91
We combined the results from all 126 concepts. The T2B line in the chart below is the top-2box Purchase Intent and the other is the AA-MD scores. The table to the right shows the
distribution of the T2B ratings and the AA-MD scores.
211
These numbers are used to set “norms” or action standards—hundreds of concepts are tested;
only a few go forward. There are many similarities between the results. With T2B ratings, there
is only a six percentage point difference between something that is good and something truly
outstanding, i.e., 80th to 95th percentile. The spread is much wider (a 12 point difference) with
AA-MD scores. The adaptive nature of the MaxDiff exercise means that the worse performing
concepts are less differentiated, but that is of little concern.
The five “star” concepts all performed well in AA-MD, consistently showing up in the top
third. The results were mixed in T2B purchase intent ratings, with one star concept (#4) slipping
below the top one-third and one star concept (#5) falling into the bottom half.
When the same two star concepts (#3 and #4) were repeated in wave 5, the AA-MD results
were consistent between the two waves for both concepts, while the T2B ratings were more
varied. Star concept #4’s T2B rating jumped 9 points in the percent rank from wave 4 to wave 5.
While these results are not conclusive given that we only have 2 concepts, they are consistent
with the type of results we expect from these approaches.
212
4. DISCUSSION
As previously noted, concepts are usually tested as they come along, without prioritization.
This naturally occurring randomization process may be one of the reasons the Anchored
Adaptive MaxDiff methodology appears to be free from the context effect brought on by testing
the concepts at different times, i.e., waves.
Wirth & Wolfrath (2012) proposed Express MaxDiff to deal with large numbers of items.
Express MaxDiff employs a controlled block design and utilizes HB’s borrowing strength
mechanism to infer full individual parameter vectors. In an Express MaxDiff setting, respondents
are randomly assigned into questionnaire versions, each of which deals with only a subset of
items (allocated using an experimental design). Through simulation studies, the authors were
able to recover aggregate estimates almost perfectly and satisfactorily predicted choices in
holdout tasks. They also advised users to increase the number of prior degrees of freedom, which
significantly improved parameter recovery.
Another key finding from Wirth & Wolfrath (2012) is that the number of blocks used to
create the subset has little impact in parameter recovery. However, increasing the number of
items per block improves recovery of individual parameters. Taking this idea to the extreme, we
can create an experiment where each item is shown in one and only one of the blocks (as few
blocks as possible), and there are a fairly large number of items per block. This very much
resembles our current experiment with 126 concepts divided into 5 blocks, each with 20–30
concepts.
If we pretend our data came from an Express MD experiment (excluding data from anchoring),
we can create an HB run with all 126 items together using all 6,200 respondents. Using a fairly
strong prior (d.f. =2000) to allow for borrowing across samples, the results correlate almost
perfectly with what we obtained from each individual dataset. This again demonstrates the lack
of any context effect due to the “wave” design. This also explains why we see only a marginal
213
decline in consistency between AA-MD score and T2B ratings when anchoring is excluded from
the model.
5. “ABSOLUTE” COMPARISON & ANCHORING
While we are satisfied that Anchored Adaptive MaxDiff used in the continuous concept
testing environment is indeed free from this context effect, can this methodology be used in other
settings where such naturally occurring randomization does not apply? Several conference
attendees asked us if Anchored Adaptive MaxDiff would work if all the concepts tested were
known to be “good” concepts. While we have no definitive answer to this question and believe
further research is needed, Orme (2009b) offers some insights into this problem.
Orme (2009b) looked at MaxDiff anchoring using, among other things, a 5-point rating scale.
There were two waves of the experiment. In wave 1, 30 items were tested through a traditional
MaxDiff experiment. Once that was completed, the 30 items were analyzed and ordered in a list
from the most to the least preferred. The list was then divided in two halves with best 15 items in
one and the worst 15 items in the other. In wave 2 of the study, respondents were randomly
assigned into an Adaptive MaxDiff experiment using either the best 15 list or the worst 15 list.
The author also made use of the natural order of a respondent’s individual level preference
expressed through his Adaptive MaxDiff results. In particular, after the MaxDiff tasks, each
respondent was asked to rate, on a 5-point rating scale, the desirability of 5 items which
consisted of the following:
Item1:
Item2:
Item3:
Item4:
Item5:
214
Item winning Adaptive MaxDiff tournament
Item not eliminated until 4th Adaptive MaxDiff round
Item not eliminated until 3rd Adaptive MaxDiff round
Item not eliminated until 2nd Adaptive MaxDiff round
Item eliminated in 1st Adaptive MaxDiff round
Since those respondents with the worst 15 list saw only items less preferred, one would
expect the average ratings they gave would be lower than those respondents who saw the best 15
list. However, that was not what the respondents did. Indeed the mean ratings for the winning
item were essentially tied between the two groups.
Winning MaxDiff Item
Item Eliminated in 1st
Round
N=115
N=96
Worst15
Best15
3.99
3.98
2.30
3.16
(Table 2—Orme 2009b)
This suggested that respondents were using the 5-pt rating scale in a “relative” manner,
adjusting their ratings within the context of items seen in the questionnaire. This made it difficult
to use the ratings data as an absolute measuring stick to calibrate the Wave 2 Adaptive MaxDiff
scores and recover the pattern of scores seen from Wave 1. A further “shift” factor (quantity A
below) is needed to align the best 15 and the worst 15 items.
(Figure 2—Orme 2009b)
In another cell of the experiment, respondents were asked to volunteer (open-ended question)
items that they would consider the best and worst in the context of the decision. Respondents
were then asked to rate the same 5 items selected from their Adaptive MaxDiff exercise along
with these two additional items on the 5-point desirability scale.
215
N=96
N=86
Worst15
Best15
Winning MaxDiff Item
3.64
3.94
Item Eliminated in 1st
Round
2.13
2.95
(Table 3—Orme 2009b)
(Figure 3—Orme 2009b)
Interestingly, this time the anchor rating questions (for the 5-items based on preferences
expressed in the Adaptive MaxDiff) yielded more differentiation between those who received the
best 15 list and those with the worst 15 list. It seems that asking respondents to rate their own
absolute best/worst provided a good frame of reference so that 5-pt rating scale could be used in
a more “absolute” sense. This effect was carried through into the modeling. While there was still
a shift in the MaxDiff scores from the two lists, the effects were much less pronounced.
We are encouraged by this result. It makes sense that the success of anchoring is directly tied
to how the respondents use the anchor mechanism. If respondents were using the anchoring
mechanism in a more “absolute” sense, the anchored MaxDiff score would be more suitable as
an “absolute” measure of preferences, and vice versa.
Coincidentally, McCullough (2013) contained some data that showed that respondents do not
react strictly in a “relative” fashion to direct binary anchoring. The author asked respondents to
select all the brand image items that would describe a brand after a MaxDiff exercise about those
brand image items and the brands. Respondents selected many more items for the two existing
(and well-known) brands than for the new entry brand.
216
Average number of Brand Image Item Selected
Brand #1
Brand #2
New Brand
4.9
4.2
2.8
Why are respondents selecting more items for the existing brands? Two issues could be at
work here:
1. There is a brand-halo effect. Respondents identify more items for the existing brands
simply because the brands themselves are well known.
2. A well-known brand would be known for more brand image items, and respondents are
indeed making “absolute” judgment when it comes to what is related to that brand and
are making more selections because of it.
McCullough (2013) also created a Negative DBR cell where respondents were asked the
direct binary response question for all the items that would not describe the brands. The author
found that the addition of this negative information helped to minimize the scale-usage bias and
removed the brand-halo effect. We see clearly now that existing brands are indeed associated
with more brand image items.
(McCullough 2013)
While we cannot estimate the exact size of the effects due to brand-halo, we can conclude
that the differences we observed in the number of items selected in the direct binary response
question are at least partly due to more associations with existing brands. That is, respondents are
making (at least partly) an “absolute” judgment in terms of what brand image items are
associated with each of the brands. We are further encouraged by this result.
To test our hypothesis that respondents make use of direct binary response questions in an
“absolute” sense, we set out to collect additional data. We asked our colleagues from Angus Reid
Global (a firm specializes in public opinion research) to come up with two lists: list A included 6
items that were known to be important to Canadians today (fall of 2013) and list B included 6
items that were known to be less important.
217
List A
Economy
Ethics / Accountability
Health Care
Unemployment
Environment
Tax relief
List B
Aboriginal Affairs
Arctic Sovereignty
Foreign Aid
Immigration
National Unity
Promoting Canadian Tourism
Abroad
We then showed a random sample of Angus Reid Forum panelists one of these lists
(randomly assigned) and asked them to select the issues they felt were important. Respondents
who saw the list with known important items clearly selected more items than those respondents
who saw the list with known less important items.
Number of Items Identified as Important Out Of 6 Items
List A
List B
Most Important Items
Less Important Items
n=
505
507
Mean
3.0
1.8
Standard Deviation
1.7
1.4
p-value on the difference in means <0.0001
This result is encouraging because it indicates that respondents are using the direct binary
anchoring mechanism in an “absolute” sense. This bodes well for using Anchored Adaptive
MaxDiff results for making “absolute” comparisons.
6. CONCLUSIONS
Many companies perform continuous concept test using sequential monadic rating. MaxDiff
can be a superior approach, as it provides better differentiation between the concepts tested.
Using Adaptive MaxDiff mitigates the downside of MaxDiff from the respondents’ perspective,
improving their engagement. The problem with using MaxDiff in a continuous testing
environment is that MaxDiff results are relative, not absolute. Therefore, some form of anchoring
is needed to compare results between testing waves. Our results demonstrate that using Adaptive
MaxDiff with a direct-binary anchoring technique is a feasible solution to this problem. A
promising field for further research is to examine waves that differ significantly in the quality of
the concepts they are testing. However, we have some evidence that respondents use the direct
binary anchoring in a suitably “absolute” sense. Given the importance of separating the wheat
from the chaff at a relatively early (and inexpensive) stage of the product development process,
we believe that this approach is an important step forward for market research.
218
Rosanna Mau
Jane Tang
REFERENCES
Chrzan, K. & Golovashkina, N. (2006), “An Empirical Test of Six Stated Importance Measures”
International Journal of Market Research Vol. 48 Issue 6
Horne, J., Rayer, B., Baker, R., & Lenart, S. (2012) “Continued Investigation into the Role of the
‘Anchor’ In Maxdiff and Related Tradeoff Exercises” Sawtooth Software Conference
Proceedings
Lattery, K. (2011) “Anchoring Maximum Difference Scaling Against a Threshold—Dual
Response and Direct Binary Responses” Sawtooth Software Technical Paper Library
McCullough, P. R. (2013) “Brand Imagery Measurement: A New Approach” Sawtooth Software
Conference Proceedings
Orme, B. (2006), “Adaptive Maximum Difference Scaling” Sawtooth Software Technical Paper
Library
Orme, B. (2009), “Anchored Scaling in MaxDiff Using Dual Response” Sawtooth Software
Technical Paper Library
Orme, B. (2009b), “Using Calibration Questions to Obtain Absolute Scaling in MaxDiff”
SKIM/Sawtooth Software Conference
Wirth, R. & Wolfrath, A. (2012) “Using MaxDiff for Evaluating Very Large Sets of Items:
Introduction and Simulation-Based Analysis of a New Approach” Sawtooth Software
Conference Proceedings
219
HOW IMPORTANT ARE THE OBVIOUS COMPARISONS IN CBC?
THE IMPACT OF REMOVING EASY CONJOINT TASKS
PAUL JOHNSON
WESTON HADLOCK
SSI
BACKGROUND
Predicting consumer behavior with relatively high accuracy is a fundamental necessity for
market researchers today. Multi-million dollar decisions are frequently based on the data
collected and analyzed across a vast number of differing methodologies. One of the more
prominent methods used today involves the study of choice via conjoint exercises (Orme, 2010).
Conjoint analysis asks respondents to make trade-off decisions between different levels of
attributes. For example, a respondent could be asked if they prefer to see a movie the evening of
its release for $15 or wait a week later to see it for $10. However, many computer-generated
conjoint designs will include some comparisons where you can get the best of both worlds, such
as comparing opening night for $10 to a week later for $15. These easier comparisons contain a
dominated concept where our prior expectations tell us that nobody is going to wait a week in
order to pay $5 more.
Choice tasks with dominated concepts are easy for respondents, but we learn little from them.
The dominated concept is rarely selected because respondents do not need to sacrifice anything
by avoiding it. Theoretically, showing these dominated concepts takes respondent time, without
providing much additional benefit because we believe they will not select the dominated concept.
Ideally each of the concepts presented should appeal to the average respondent about equally
(similar utility scores), but just in different ways. Utility balancing will allow us to distinguish
between the preferences of each individual rather than seeing everyone avoiding the obviously
inferior concepts.
Some efforts have been made to incorporate utility balancing in randomized conjoint designs.
One option is to specify prior part-worth utilities for the products shown and define allowable
ranges for utilities in a single task. This approach is rarely used because in most cases the exact
magnitude of the expected utility of each attribute level is not known, so the a priori utilities of
the products cannot be calculated. Prohibitions are another way to limit dominated concepts.
Keith Chrzan and Bryan Orme saw theoretical efficiency gains when using corner prohibitions
with simulated data, but prohibitions in general can be dangerous (Orme & Chrzan, 2000).
Conditional and summed pricing are more effective ways to adjust price to avoid these
dominated concepts by forcing an implicit tradeoff with price (Orme, 2007). While they are
effective at reducing the number of dominated concepts found in a design, none of these
techniques are 100% effective at making sure that they do not appear anywhere in the design.
With these things in mind, we examine an alternative application to balancing the product
utility in a conjoint task. While the exact magnitude of the a priori part-worth utilities are not
normally known, it is common to know inside each attribute the a priori order of the part-worth
utilities. We identify tasks containing dominated concepts (referred to as easy tasks) by
comparing the attribute levels of products inside each task; if any product is equal to or superior
221
to another product on all attributes with an a priori order then it is considered an easy task. These
easy tasks are then replaced with a new conjoint task.
Theoretically, avoiding these dominated concepts in a standard randomized design would
achieve similar hit-rate percentages with fewer tasks. However, we theorize based on our own
experience that both designs will have similar hit rates. We also hypothesize that eliminating the
easy tasks will increase the difficulty of the task and respondents will take more time to complete
each task. Due to this increased difficulty, we thought the respondent experience would be
negatively impacted.
METHODS
To maintain high incidence and control costs we sampled 500 respondents from SSI’s online
panels after targeting for frequent movie attendance (at least once a month). The conjoint probed
the preferences for different attributes in the movie theater experience ranging from the seating
to the concessions included in a bundling option. We balanced on gender and age for general
representativeness of the United States. Once respondents qualified for the study they were
randomly shown one of two conjoint studies: balanced overlap design (control) or the same
design with all easy tasks removed (treatment). We used the balanced overlap design as our
control because it is the default of the Sawtooth Software and widely used in the industry. The
attributes and levels tested in the design are shown below in the a priori preference from worst to
best. Note that these prior expectations may not prove correct. For example we assumed that
having one candy, one soda per person and a group popcorn bucket was better than just having
one candy and one soda per person. However it could be that some people do not like popcorn
and would actually prefer going without that additional item. When making these prior
expectation assumptions it is important to keep in mind that preferences can be very
heterogeneous and what might be a dominated concept for one person might not be a dominated
concept for another.
Table 1. Design Space
Attribute
New Release
Food Included
Seating
Minimum
Purchase
222
Levels
Opening Day/Night
Opening Week/Weekend
After Opening Week
1 Candy 1 Soda per person Group popcorn
bucket
1 Candy 1 Soda per person
Group popcorn bucket
1 Candy per person
No Package Provided
Choose Your Own/Priority Seating
General Admission Seating
None
3 tickets
6 tickets
Movie Type
Drive Time
Price
3D
Standard 2D
Fewer than 5 minutes
5–10 minutes
10–20 minutes
20–30 minutes
Over 30 minutes
$8.00
$8.50
$9.00
$9.50
$10.00
$10.50
$11.00
$11.50
$12.00
$12.50
$13.00
$13.50
$14.00
The treatment design used the balanced overlap design as a base. It had 300 versions with 8
tasks in each totaling 2,400 tasks. Each task showed four possible movie packages and a None
option in a fifth column. After exporting the design into Microsoft Excel, we wrote a formula
that searched for tasks with dominated concepts. It identified 562 easy tasks (23.4% of the total
tasks) within the original design. We removed these tasks from the design and renumbered the
versions and tasks to keep it consistent with eight tasks shown in each version. For example, if
version 1 task 8 was an easy task we changed version 2 task 1 to the new version 1 task 8 and
version 2 task 2 became the new version 2 task 1. After following this process for the entire
design matrix, we were left with 229 complete versions for the treatment design. We ran the
diagnostics on both designs and they both had high efficiencies (the lowest was .977), so we felt
comfortable that both designs were robust enough to handle the analysis.
After the random tasks were shown to each respondent, we showed three in-sample holdout
tasks. These holdout tasks mimicked the format of the random tasks with four movie options and
a traditional none option, but they represented realistic purchasing scenarios that could be seen in
the market without any dominated concepts. We used these tasks to measure hit rates under
differing conditions. No out-of-sample holdout tasks were tested. We ran incremental HB utilities
on each data set starting with one task and going to all eight tasks being included in the utility
calculations. These utilities test how well each design performs using fewer tasks to predict the
holdout tasks. We also examined how variable the average part-worth utilities were in each
design as more tasks were added. Theoretically, the treatment design should have higher hit rates
and more stable utilities with fewer tasks because it doesn’t waste time collecting information on
the easy tasks.
223
At the end of the survey, respondents were asked two questions on a 5-point Likert scale
about their survey experience: “How satisfied were you with your survey experience?” and
“How interesting was the topic of the survey to you?” These are the standard questions SSI uses
to monitor respondents’ reactions to surveys they complete. Lastly the time taken on each of the
random tasks was recorded to see if the difficult tasks required more time.
RESULTS
Design Performance with Fewer Tasks
The hit rate percentages were moderately higher in both designs as more tasks were included.
The holdouts where none was selected were not excluded and were only counted as a hit if the
model also predicted that they would select none. Because of the nature of the holdout tasks we
would expect a hit rate somewhere between 20% (5 options with the none) and 25% (4 options
without the none) just by chance. In each design the hit rate percentage does increase, showing
that with more information in general we see better predictive performance. In one holdout task,
we see a significant lift in the predictive performance of the treatment design (Figure 1).
However, once we included at least five tasks in the utility calculations there is no difference
between the two designs. Also, even with very few tasks we do not see a significant lift in the
predictive performance of the treatment design on the second tasks (Figure 2). The third holdout
was between the first and the third when there was an incremental, but not statistically significant
lift in the holdout prediction with less than five tasks (Figure 3). While there is an indication that
the treatment design can produce a slight gain in predictive ability it isn’t consistent in other
holdout tasks. When we examined the average utilities, we looked to see if the average utilities
were about the same with few tasks as they were with all 8 tasks. We found that for some
attributes the control design estimated better with fewer tasks (meaning that the average utilities
with only 4 tasks were closer to the average utilities with all 8 tasks), but for other attributes the
treatment design estimated better. There was no significant improvement in the average utilities
by the treatment design.
224
Figure 1. Holdout 1 Hit Rate Percentages by Design and Number of Tasks
Figure 2. Holdout 2 Hit Rate Percentages by Design and Number of Tasks
225
Figure 3. Holdout 3 Hit Rate Percentages by Design and Number of Tasks
Design Completion Times
Overall both designs took comparable amounts of time to complete the task (Figure 4). The
first tasks average around 45 seconds and by the time the respondents get the hang of the
exercise they are taking tasks four to eight in a little under 20 seconds each. However, the pattern
in the control group is odd. It doesn’t follow the normal smooth trend of a learning curve. This
can be explained by separating out the difficult and the easy tasks in the control design. The easy
tasks in the control design take significantly less time when seen in the first through the fourth
task (Figure 5). The combination of these two types of tasks could be what is producing the time
spike on task three for the control group.
Figure 4. Completion Time by Task and Design Type
226
Figure 5. Completion Time by Task and Design Type
Design Satisfaction Scores
Respondents seemed not to mind the increased difficulty of the treatment design. In fact there
are small indications that they enjoyed the more difficult survey. The top two box on both the
satisfaction and interesting questions were 8% higher in the treatment design (Figure 6).
However, when a significance test is done on the mean, the resulting t-test does not show a
statistically significant increase in the overall mean (Table 2). Another possible explanation is
that the utility balanced group had a slightly smaller percentage of respondents rating their
survey experience (73% versus 78%) which could have either raised or lowered their satisfaction
scores. In the end there might be some indications of better survey experience, but not enough to
conclude that the treatment design produced a better experience.
227
Figure 6. Satisfaction Ratings by Design Type
Table 2. T-test of Mean Satisfaction Ratings
t-Test: Two-Sample Assuming Unequal Variances
How satisfied were you
with your survey
experience?
Mean
Variance
df
t Stat
P(T<=t) one-tail
Control
4.07
0.67
Utility Balanced
4.11
0.61
377
-0.53
0.30
How interesting was the
topic of the survey to you?
Utility
Control
Balanced
4.02
4.13
0.71
0.66
377
-1.24
0.11
DISCUSSION AND CONCLUSIONS
In general the treatment design which removed dominated concepts performed about the
same as the standard balanced overlap design. While there was a slight lift in the predicted
holdouts when you have sparse information for the utilities (less than five tasks for this specific
design space), both designs had similar predictive capabilities. The end results of the utilities
were comparable at the aggregate level even with sparse information. The easy tasks clearly took
less time to complete, but the difference went away as the respondents became accustomed to the
task in general. Lastly, there seems to be slight evidence that removing dominated concepts can
increase the interest and satisfaction in the survey, but not enough to statistically move the mean
rating.
There are other reasons for removing these dominated tasks. The most common is when a
client encounters one and rightly questions the reason for making these comparisons which seem
trivial and obvious to them. This feedback suggests another possibility of just not showing
dominated concepts. Hiding the dominated concepts in standard choice-based concepts would
essential automatically code the dominated concept as inferior to whatever concept the
228
respondent selected. Future research needs to be done on the effects of this type of automated
design adjustment.
In this instance the a priori assumptions were largely correct. Dominated concepts were
chosen on average 6% of the time they were shown (one fourth of the rate you would expect to
see by chance). They were still selected sometimes which can indicate people who have more
noise in their utilities or people who might even buck the trend and actually prefer inferior levels.
For example, there could be a significant number of people who would prefer a 2D movie to a
3D movie. Imposing these assumptions and removing dominated concepts is a risk that is taken
should the assumptions not be correct.
Paul Johnson
Weston Hadlock
REFERENCES
Burdein, I. (2013). Shorter isn’t always better. CASRO Online Conference. San Francisco, CA.
Orme, B. (2007). Three ways to treat overall price in conjoint analysis. Sawtooth Software
Research Paper Series, Retrieved from
http://www.sawtoothsoftware.com/download/techpap/price3ways.pdf
Orme, B. (2010). Getting started with conjoint analysis: Strategies for product design and pricing
research. (2nd ed.). Madison, WI: Research Publishers LLC.
Orme, B., & Chrzan, K. (2000). An overview and comparison of design strategies for choicebased conjoint analysis. Sawtooth Software Research Paper Series, Retrieved from
http://www.sawtoothsoftware.com/download/techpap/desgncbc.pdf
229
SEGMENTING CHOICE AND NON-CHOICE DATA
SIMULTANEOUSLY
THOMAS C. EAGLE
EAGLE ANALYTICS OF CALIFORNIA
Segmenting choice and non-choice data simultaneously refers to segmentation that combines
the multiple records inherent in a choice model estimation data set (i.e., the choices across
multiple tasks) with the single observation related to the non-choice data (e.g., behaviors,
attitudes, ratings, etc.) and derives segments using both data simultaneously. This is different
from conducting a sequential approach to segmenting these data (i.e., fitting a set of individuallevel utilities first and then combining those with the non-choice data). It is also different from
repeating the non-choice data across the choice tasks data. The method can account for the
repeated nature of the choice data and the single observation of the non-choice data within a
latent class modeling framework. Latent Gold’s advanced syntax module enables one to conduct
such an analysis.
The original intent of this paper was to show how a researcher could conduct a segmentation
using both choice and non-choice data simultaneously without using a sequential approach: that
is, fitting a Hierarchical Bayes MNL model first, combining the resulting individual level
attribute utilities with the non-choice data, and subjecting that to a segmentation method such as
K-Means or Latent Class modeling. Over time, the objective evolved into demonstrating why
segmenting derived HB MNL utilities can be problematic.
If the objective of the research dictates the use of a method designed specifically for that
objective, then clearly one should use that method. However, in some cases the method has to be
adapted. The situation of segmenting choice and non-choice data has been problematic because it
there has not been a simple way to perform this simultaneously. Most practitioners, myself
included, have resorted to fitting HB MNL models first, adding the non-choice data to the
resulting utilities, and then conducting the segmentation analyses, simply because that seemed to
be the only way to go. In addition to discussing why that approach is suspect, this paper
demonstrates how to perform the segmentation properly within a latent class framework.
The paper uses a simple contrived simulation data set to show why sequentially segmenting
HB utilities with non-choice data has issues. We also show how to perform the analyses within a
latent class framework without resorting to fitting HB utilities first. A real-world example is also
briefly discussed in this paper.
WHY WORRY ABOUT USING HB DERIVED CHOICE UTILITIES IN SEGMENTATION?
Individual-level utilities derived from an HB MNL model are continuous, random variables
with values determined by the priors and the data themselves. Using Sawtooth Software (and
most other standard software) to fit the HB MNL model, the priors assume a multivariate normal
variance-covariance matrix. The means of the individual-level utilities are determined by a
multivariate regression model (the “upper model”) that assumes this multivariate normality and,
in the simplest case, a single intercept to model the mean. Adding covariates to the upper model
enables the model to estimate different upper-level means given the covariate pattern. Default
settings for the degrees of freedom, the prior variance, the variance-covariance matrix, and the
231
upper-level means generally assume we do not know much about these things—they are
“uninformative” priors. All this leads to utilities that tend to regress to the population mean, are
continuously distributed across respondents, and can mask or blur genuine gaps in the
distribution of the true parameters that may underlie the data. Without knowledge to
appropriately set the priors we have the potential to see a smoothing of the utilities that can be
problematic for segmentation.
Another problem in the use of HB utilities in segmentation is that we are using mean utilities
as if they are fixed point estimates of the respondents’ utilities. The very nature of the HB model
is that the mean utilities are means of a “posterior distribution” reflecting our uncertainty about
the respondent’s true utilities. Using the point estimates as though they are fixed removes the
very uncertainty that Bayesian models so nicely capture and account for. Should we do that when
conducting segmentation?
Another issue is that, as Sawtooth Software recommends, and as this paper will show, one
should rescale the derived HB utilities before using the sequential approach. However, the degree
of rescaling can affect the segmentation results in terms of the derived number of optimal
segments. The reason for rescaling is that the utilities of one respondent cannot be directly
compared to another respondent because of scale differences (i.e., the amount of uncertainty in a
respondent’s choices). This paper will show not only that rescaling is required even with data
where the amount of uncertainty is exactly the same across all respondents, but also that the
degree of rescaling may affect the results.
A final issue is that segmenting derived HB utilities in a K-means-like clustering, hierarchical
clustering, or latent class clustering method which assumes no model structure to the
segmentation is not the same thing as conducting segmentation where an explicit model
structure, such as a choice model, is imposed. If one’s objective is segmenting a sample of
respondents on the basis of a choice model, why use a method that does not employ that model
structure to segment the respondents? Why resort to using a method that is not model-based
when such a model (a latent class choice model, in particular) needs fewer assumptions about the
distributions of the data than does the typical HB choice model used by practitioners?
If one’s objective is segmentation, the question is, why would one want to subject the data to
any of this if they do not have to do so?
LATENT CLASS SEGMENTATION OF BOTH CHOICE AND NON-CHOICE DATA
Latent Gold’s advanced syntax module has the capabilities of conducting the segmentation
analyses using choice and non-choice data. Another software package, MPLUS (available from
Statmodel.com), has these capabilities when one is able to “trick” the program to handle the
multinomial logit model (MPLUS does not have the built-in capability to fit the classic
McFadden MNL model). The key to conducting this simultaneous segmenting of choice and
non-choice data is in the general capabilities of Latent Gold, the syntax itself, and the structure of
the data file. Figure 1 depicts a snippet of the Latent Gold syntax.
232
Figure 1: A snippet of code from Latent Gold choice syntax
The dashed boxes identify the syntax that deal with the non-choice variables. The choice
modeling syntax is bolded. A more complete example with the appropriate data structure is given
in the Appendix.
SIMULATION EXAMPLES
To examine the impact of segmenting HB derived choice utilities we build a simple
simulation of a MaxDiff Task. The MaxDiff task consists of 6 items shown in a balanced
incomplete block design (BIBD) of 12 tasks each with 3 items appearing in each task. Each item
is seen 6 times and each is seen with every other equally often. We create 4 segments with the
known utilities show in Table 1.
Table 1: Actual segment utilities used to generate individual utilities
Actual
Seg 1
Seg 2
Seg 3
Seg 4
Item
1
2
-4
-2
4
Item
2
4
2
-4
-2
Item
3
-4
-2
4
2
Item
4
-2
4
2
-4
Item
5
3
1
-3
-1
Item
6
0
0
0
0
Total
100
100
100
100
Seg 3 is Seg 1
flipped; Seg 4 is Seg
2 flipped
The utilities for 100 respondents per segment were constructed from the above by generating
individual-level parameters from a univariate normal deviate with mean 0 for each item. The
standard deviation used for the normal deviate varied as described in the cases below. The raw
choices for each task for each respondent were generated using each respondent’s generated
utilities and adding a Gumbel error with a scale of 1.0 (i.e., the error associated with each total
utility for a task was: -ln(-ln(1-uniform draw[0,1])) * 1.0 (the Gumbel error scale) for the best
choice and +ln(-ln(1-uniform draw[0,1])) * 1.0 for the worst choice). These data were generated
using SAS.
The non-choice data was generated using the simulation capabilities of MPLUS. The
following non-choice data were generated for 4 segments (Table 2).
233
Table 2: Non-Choice Distribution Across and Within Segments
Variable
X1
X2
X3
Attitude 1
Attitude 2
Attitude 3
Value\Size
0
1
0
1
0
1
Mean (7 point)
Mean (7 point)
Mean (10 point)
Seg 1
95
94%
6%
95%
5%
96%
4%
6.4
1.9
3.0
Seg 2
107
6%
94%
93%
7%
53%
47%
2.6
5.4
8.0
Seg 3
103
95%
5%
3%
97%
4%
96%
2.4
1.9
8.1
Seg 4
95
96%
4%
3%
97%
49%
51%
5.5
5.4
3.1
Several simulations (Cases) are discussed in the paper:
1) Highly Differentiated—Raw: A simulation where the actual choice data is highly
concentrated within a segment, but highly differentiated across segments, and where the
non-choice data is NOT included. The standard deviation used to generate the known
utilities was 0.33.
2) Highly Differentiated—Aligned: The same choice data as #1 above, but with the nonchoice data mapped one-to-one to the choice data. That is, segment 1 data for the nonchoice data is aligned to segment 1 of the choice data.
3) Highly Differentiated—Random: The same choice data as #1 above, but the non-choice
data is randomly assigned to the choice data. In other words, the choice data and nonchoice data do not share any common underlying segment structure in this case.
4) Less Differentiated—Aligned: Data where the generated choice utilities are not as
concentrated within segments and less differentiated across segments as a result. The
standard deviation used to generate the known utilities was 1.414. The non-choice data is
aligned to the choice segments as in Case 2.
Figure 2 below depicts the actual individual-level choice utilities generated for the Highly
Differentiated simulations (they were the same for Cases 1–3 above). Figure 2 is for items 1 and
2 only, but plots for other pairs of items showed a similar pattern of highly differentiated
segments of respondents. Figure 3 (also below) depicts the individual-level utilities generated for
the Less Differentiated (Case 4 above) simulation. Again, other pairs of the items show similar
patterns in the individual-level utilities.
234
Figure 2: Generated Individual Utilities and Segment Centroids for Items 1 & 2—
Highly Differentiated
ESTIMATION
We used Sawtooth Software’s CBC HB program to estimate the individual-level utilities
using their recommended settings for fitting MaxDiff choice data. These include: a prior variance
of 2; prior degrees of freedom of 5; and a prior variance-covariance matrix of 2 on the diagonals
and 1 on the off-diagonal elements. We used 50,000 burn-in iterations and our posterior mean
utilities are derived from 10,000 iterations subsequent to the burn-in iterations.
235
Figure 3: Generated Individual Utilities and Segment Centroids for Items 1 & 2—
Less Differentiated
We used Latent Gold 4.5 to segment all simulation data sets. We used the Latent Gold syntax
within the choice model advanced module to simultaneously segment the choice and non-choice
data. When segmenting the derived HB utilities we used Latent Gold’s latent class clustering
routine. This is equivalent to using K-means when all variables are continuous (see Magidson &
Vermunt, 2002). We investigate three ways of estimating and rescaling the HB utilities: 1) raw
utilities as derived from using the Sawtooth Software recommended settings for fitting MaxDiff
data; 2) utilities derived by setting the prior degrees of freedom so high as to “pound” the utilities
into submission (“hammering the priors”); and 3) rescaling the raw (unhammered) utilities to
have a range from the maximum to minimum utility of 10 (Sawtooth Software recommends a
value of 100).
It must be noted that latent class clustering is not the same as latent class choice modeling.
Latent class clustering is equivalent to trying to minimize a distance metric among observations
within a segment while maximizing the differentiation between segments. It is applied to the
estimated HB choice utilities like it would be to any other set of numbers, without any
knowledge of what they are or where they came from. Latent class choice modeling imposes a
model structure on the creation of segments, in this case, the MNL model itself. It finds a set of
segments that maximizes the differences in utilities across segments given the model structure.
236
Because latent class clustering has no model structure equivalent to latent class choice modeling,
this leads to differences in the results.
CASE 1: HIGHLY DIFFERENTIATED—RAW RESULTS
Using the usual BIC (Bayesian Information Criterion) to determine the best number of
segments, the optimal solution for the latent class MNL model produced 4 segments. The
centroids of the derived segments are almost exactly the same as those used to generate the data.
Because the locations are hidden in Figure 4, the actual coordinates are given in Table 3.
Figure 4 depicts the actual MNL utilities, the derived HB MNL utilities, the centroids of the 4
true segments, the centroids for the 4 segment latent class clustering using the derived HB
utilities, and the centroids of the BIC-optimal 10 segments from the latent class clustering, all for
items 1 & 2. The larger black plus signs are the centroids of the latent class choice model
segments (these are not clearly visible so their values are provided in Table 3). The larger black
stars are the centroids for the BIC-optimal 10 segments from the latent class clustering on the
derived HB utilities. The smaller black crosses represent the locations of the HB derived
individual-level utilities. The smaller gray dots are the true underlying utilities.
Figure 4: Results from the latent class clustering of the derived HB utilities
237
Table 3: Derived segment utilities (centroids) for latent class MNL model
Actual
Seg 1
Seg 2
Seg 3
Seg 4
Item 1
1.9
-3.7
-1.9
4.0
Item 2
3.5
2.0
-3.9
-1.7
Item 3
-4.1
-1.9
4.0
2.0
Item 4
-2.1
3.8
1.9
-3.7
Item 5
2.8
1.0
-2.9
-.8
Item 6
0
0
0
0
Total
100
100
100
100
There are two things to notice in Figure 4: 1) there is an increase in the variance, or spread,
of the HB derived utilities (small black crosses) compared to the actual utilities (small gray
dots); and 2) correspondingly, the optimal 10 latent class clustering segment centroids (large
black stars) have spread out away from actual segment centroids (large black plus signs).
Also plotted are the centroids of the 4 segment solution from the latent class clustering on the
HB derived utilities (large black crosses). Interestingly, that 4 class solution produces perfect
classification of the respondents into the 4 underlying true segments (large black plus signs).
And, the general location of the 4 latent class clustering centroids is closer to the actual segments
than are the centroids of the BIC-optimal 10 segment solution. However, the researcher working
with this data would have no way of knowing the correct number of segments and would likely
begin their evaluation with far more segments than what actually generated these data.
There are several reasons why we see these results. The derived HB utilities are not scaled
into the same space as the actual utilities.1 This is likely the result of using the uninformative
priors that most practitioners assume when fitting MaxDiff or any choice model within HB.
There are also no upper model covariates trying to differentiate the segments. Finally, there is a
qualitative difference in the modeling approach. Latent class clustering is not the same thing as
latent class choice modeling which has the MNL model embedded in its derivation of segments.
CASE 2: HIGHLY DIFFERENTIATED—ALIGNED RESULTS
The latent class choice model segmentation produced the 4 known segments almost exactly.
The results are not different from Case 1.
The location and spread of the derived HB utilities are exactly the same between the Case 1:
Raw simulation and those of the Case 2: Aligned simulation (choice and non-choice data)—as
you might expect given the HB derived utilities are the same. The latent class clustering of the
HB derived utilities with the non-choice data produced a BIC-optimal 8 segments. The addition
of the non-choice data did improve the previous BIC-optimal solution. The new solution
collapsed 2 of the previous segments because of the non-choice data. These results are NOT
shown in a Figure because the graphics do not reveal any additional insights than what we saw in
Case 1.
The latent class clustering with the non-choice data aligned one-to-one with the choice data
strengthens the 4 segment pattern of differentiation created purposively in the data generation
process by reducing the number of BIC optimal segments. As a result, the next set of results have
fewer optimal segments than using the derived HB utilities alone. All further Highly
1
In this data we could fit individual-level MNL MaxDiff models using classical aggregate MNL estimation. The utilities derived from this
estimation are far more extreme than those we see from the HB MNL estimation! This is indicative of the regression to the mean properties of
the HB MNL model. The author can provide these estimates upon request.
238
Differentiated results, including those examining the impact of different ways of handling the
derived HB utilities, are presented using the Case 2: Aligned with its non-choice data included.
To reduce the spread of the HB derived utilities we increased the prior degrees of freedom
from the recommended value of 5 to 400, the same value as the sample size. This “hammering
the priors” should reduce the spread in the HB derived utilities. Figure 5 shows that the spread
was indeed reduced. The number of optimal segments in the latent class clustering solution fell
from 10 to 5 and the segment centroids were much closer to the latent class MNL choice model
with aligned non-choice data segmentation centroids. Again, the 4 segment solution was spot on
in classification. The only difference now is that the 5 HB-based BIC-optimal segments split
segment 3 into 2 segments differing only with respect to extremes. A researcher’s examination of
the mean segment utilities would likely drive the research towards the 4 segment solution.
Figure 5: Highly Differentiated—Aligned; Priors Hammered
The problem with this result is that the researcher does not know when to “hammer the
priors.” Should we always do so? In our simulations, we have the unfair advantage of knowing
the actual utilities—a real-world practitioner does not. In addition, the hope that the non-choice
data is aligned so perfectly with the choice data, as is the case in Figure 5, is unrealistic. In the
real-world it is more likely that there is less alignment seen.
Another approach to segmenting choice utilities is to rescale the utilities. Several methods of
rescaling exist, including rescaling the utilities so that the range in utilities for each individual
239
respondent is a constant (e.g., Sawtooth Software recommends a range of 100), or exponeniating
the utilities and rescaling them to a function of the number of items in each task. In Figure 6
below the utilities have been rescaled to have a constant range of 10, which is more in line with
the actual utilities. In order to make the actual utilities comparable on the graphic, they were also
rescaled to the same range of 10. Keep in mind, however, that the practitioner would not know to
what range value to rescale.
The plots of the derived HB utilities are almost identical to the rescaled actual utilities. There
is a stretching of the points along different axes, but this is due to the nature of the simulation’s
generation of these utilities across all 6 items. Selecting different actual utilities would have led
to a more concentric pattern of utilities.
Figure 6: Highly Differentiated—Aligned; Rescaled Utilities (Range = 10)
These results have 7 segments based upon the BIC.2 The segments cluster around the 4 actual
segment centroids and only differentiate extremes within the clustering of derived HB utilities.
Ideally the researcher would notice this and decide to reduce the number of segments to the 4
obvious segments. As noted earlier, the aligned non-choice data is helping drive the number of
segments lower than in the choice-only data segmentation we saw earlier. However, the rescaled
HB utilities are resulting in segments that are differentiated only in terms of extremes (i.e., the
2
Note the actual or predicted LC segments are not shown. Their locations have not substantially changed from
Figure 5.
240
difference among the segments is only in terms of 1 or 2 items having a higher mean utility).
This might be exacerbated if we had rescaled the utilities to a range of 100, as that would
increase their effect on the clustering distance measures vs. the non-choice data. Rescaling the
utilities is important, but the rescaling value can significantly influence the number of segments.
Profiling the segments on the choice and non-choice data makes this pattern clear. Figure 7
depicts the rescaled segment mean utilities for the 4 segment latent class choice model and the
rescaled 7 segment mean HB derived utilities latent clustering segments. Clearly, there is only a
separation of segments on rescaled utilities extremes for 1 or 2 items that differentiate the HB
segments. A similar pattern arises in the comparison of the non-choice profiles as seen in Figures
8 and 9 below.
Figure 7: Highly Differentiated—Aligned; Rescaled Mean Segment Utilities
241
Figure 8: Highly Differentiated—Aligned; Profile of Nominal Non-Choice Variables
Figure 9: Highly Differentiated—Aligned; Profile of Continuous Non-Choice Variables
The boxes surrounding pairs of segments in the latent clustered HB-rescaled segmentation
are those most similar to each other. Examination of these segments reveal that the segments
paired together only differ on the magnitude of 1or 2 non-choice variables and 1 or 2 MaxDiff
items.
CASE 3: HIGHLY DIFFERENTIATED—RANDOM RESULTS
In this simulation we randomly assigned the non-choice observations to the choice data
observations to see what would happen to the segmentation results. Both the latent class choice
model results and the latent class clustering of the HB derived utilities results suggest more
segments than before. The optimal number of segments for the latent class choice model
segmentation is 17; while the latent clustering of derived HB utilities suggests 12. In the latent
class choice model, all but one of the original 4 segments split into 4 smaller segments,
differentiated by the non-choice data (one segment split into 5 segments accounting for a few
242
outliers respondents differentiated by the utilities). In the clustered derived HB utilities
segmentation, 3 of the 4 original segments split into 3 sub-segments, while one split into 4
segments. These segments are also differentiated on the non-choice data, but not as perfectly as
the latent class choice model segments. Tables 4 and 5 show the cross tabulations of the original
utility generated segments by the BIC-optimal number of segments derived.
Table 4: Latent class choice model segment cross-tabulation (Optimal solution as columns;
original as rows); Highly differentiated—Random
Orig Segs
Seg 1
Seg 2
Seg 3
Seg 4
Seg 1
25%
18%
26%
31%
Seg 5
Seg 6
Seg 7
Seg 8
25%
25%
24%
26%
Seg 2
Seg 9
Seg 3
Seg 10 Seg 11 Seg 12 Seg 13 Seg 14 Seg 15 Seg 16 Seg 17
20%
19%
32%
29%
Seg 4
Total
6%
5%
7%
8%
6%
6%
6%
7%
5%
5%
8%
7%
6%
23%
26%
25%
20%
2%
6%
7%
6%
5%
Table 5: Derived HB utilities latent cluster segmentation cross-tabulation (Optimal as
solution columns; original as rows); Highly differentiated—Random
Orig Segs
Seg 1
Seg 2
Seg 3
Seg 1
51%
31%
18%
Seg 2
Seg 4
Seg 5
Seg 6
50%
25%
25%
Seg 3
Seg 7
Seg 8
Seg 9
50%
48%
2%
Seg 4
13%
Size
8%
5%
12%
6%
6%
12%
12%
Seg 10 Seg 11 Seg 12
1%
51%
25%
23%
1%
13%
6%
6%
This contrived pattern of results suggests the possible use of a joint basis, or joint objective,
latent class segmentation. Joint basis latent class segmentation would use two nominally scaled
latent variables for producing segmentation on two different sets of variables, in this case the
choice data and the non-choice data. This is equivalent to the joint segmentation work originally
proposed by Ramaswamy, Chatterjee and Cohen (1996) who simultaneously segment different
sets of basis variables allowing either for independent or dependent segments to be derived (by
allowing correlations among the nominally scaled latent variables). This is simple to accomplish
within the syntax framework of Latent Gold. Table 6 below depicts the cross-tabulation of results
from the joint latent class choice model segmentation using the choice and non-choice data as
separate nominal scaled latent variables. The pattern of segment assignment is nearly perfect.
Table 6: Latent class choice model using joint segmentation (choice and non-choice data—
2 latent variables)
Predicted w/ 2 LVs
Known
Non-Choice Segs
Non-Choice Segs
Choice Segs
Seg 1
Seg 2
Seg 3
Seg 4
Total
Choice Segs
Seg 1
Seg 2
Seg 3
Seg 4
Total
Seg 1
25
18
31
26
100
Seg 1
25
18
31
26
100
Seg 2
26
25
25
24
100
Seg 2
26
24
26
24
100
Seg 3
29
32
20
19
100
Seg 3
29
31
21
19
100
Seg 4
20
26
23
31
100
Seg 4
20
27
22
31
100
Total
100
101
99
100
400
Total
100
100
100
100
400
243
CASE 4: LESS DIFFERENTIATED—ALIGNED RESULTS
Figure 10 depicts the segment centroids for both the latent class choice model segmentation
and the derived HB utilities latent class clustering. In neither case was the actual number of 4
segments suggested as the optimal solution. Latent class choice modeling’s optimal solution
consists of 9 segments, whereas the derived HB utilities latent class clustering yielded 7
segments. Table 7 shows how the actual known 4 segment solution breaks out across the two
derived solutions.
Figure 10: Less Differentiated—Aligned; Derived segment centroids
Table 7: Less-Differentiated—Aligned;
Cross-tabulations of both segmentations against actual segments
Latent Class Choice Model
Non-Choice Segs
Orig Segs Seg 1 Seg 2 Seg 3 Seg 4 Seg 5 Seg 6
Seg 1
40%
60%
Seg 2
59%
41%
Seg 3
48%
52%
Seg 4
Total
40
60
59
41
48
52
Seg 7
Seg 8
Seg 9
39%
39
27%
27
35%
35
Latent Clustering with Derived HB Utilities
Non-Choice Segs
Orig Segs Seg 1 Seg 2 Seg 3 Seg 4 Seg 5 Seg 6
Seg 1
76%
24%
Seg 2
57%
43%
Seg 3
100%
Seg 4
54%
Total
76
24
57
43
100
54
Seg 7
46%
46
Without going into the details of each segment’s profile, the general conclusions that could
be drawn are the less differentiated the choice data is: 1) the more segments either approach will
244
give us, 2) the distinctions between the segments becomes murkier and differences are only in
the extremes, and 3) the non-choice data begins to more strongly influence the solution.
With these results, it is hard to make a concrete recommendation to use latent class choice
modeling incorporating non-choice data over the sequential derived HB utilities combined with
non-choice approach. Issues are raised, but the evidence is not clear. Conceptually, if one’s goal
is segmentation, one should use segmentation methods without resorting to using the sequential
HB approach. However, this last simulation, which is more likely the situation we would see
using real world data, suggests that either approach might work.
REAL-WORLD EXAMPLE
A latent class choice model using non-choice data was estimated for a segmentation project
in which customers of international duty free shops were surveyed. The objective of the research
was to develop a global segmentation that would enable the suppliers of duty free shops to tailor
their products to the customers frequenting the shops. The client expected a large number of
segments, of which only a few might be of direct interest to a specific supplier (e.g., tobacco,
liquors, or electronics). The segmentation bases included actual shopping behaviors, benefits
derived from shopping (MaxDiff task 1), the desire for promotions (MaxDiff task 2), and
attitudes towards shopping.
An online survey was conducted among 3,433 travelers recruited at 28 international airports
around the world who had actually purchased something in a duty free shop on their last trip. The
questionnaire included the two MaxDiff tasks, and a set of bi-polar semantic differential scale
questions regarding attitudes towards shopping in general and duty-free shopping in particular.
Behavioral items included the types of trip (e.g., business vs. leisure; economy vs. first class), the
frequency of trips, their spending at duty-free shops, and socio-demographics.
A simultaneous, single latent variable, latent class choice model which included the nonchoice data was estimated. A batch file to determine solutions with 2 to 25 segments was created
and took over 2 days to run. Many different runs were made with direct client involvement in the
evaluation of the interim solutions. Variables were dropped, added and transformed to tailor the
solution and drive toward a solution the client felt was actionable. This process took over 3
weeks. In the final run, the BIC-optimal configuration had 13 segments; the client chose the 14
segment solution as the best from a business standpoint.
Figures 11, 12, and 13 summarize the final set of segments.
245
Figure 11: Real World Example Segment Summaries—Part 1
Figure 12: Real World Example Segment Summaries—Part 2
246
Figure 13: Real World Example Segment Summaries—Part 3
DISCUSSION
Segmentation is hard. One must be very careful when conducting segmentation. I strongly
believe if the main objective of the research is segmentation one should use segmentation
methods. Prior to the development of Latent Gold it was difficult for the practitioner to
simultaneously segment respondents on the basis of choice and non-choice data. Practitioners
could either:
Segment solely using the choice data (preferably with a latent class MNL model) and then
profile the resulting segments on the non-choice data of interest;
Separately segment the choice and non-choice data and bring the results together using either
ensemble techniques or by adding one segmentation assignment to the other subsequent
segmentation as an additional variable; or
Derive MNL utility estimates using HB methods and add the non-choice data to them before
performing segmentation analyses.
This paper demonstrated issues that arise using the third approach. The ideal approach is to
conduct a simultaneous segmentation using both choice and non-choice data together. Latent
Gold’s syntax enables the practitioner to conduct such analyses.3 While we recommend
simultaneously segmenting choice and non-choice data using latent class methods when the
objective is purely segmentation, we do not wish to suggest segmenting derived HB utilities is
3
In addition to the Appendix of the paper, the author’s website, www.eagleanalytics.com provides some examples from the appendix of the
presentation that shows one how to set up such a segmentation, as well as the original presentation slides.
247
“wrong.” Rather, our goal was to show why using derived HB utilities within segmentation
without careful consideration of issues such as the setting of priors and the magnitude of
rescaling the utilities can affect the results one sees. But, these issues are avoided when
conducting the segmentation simultaneously.
We also demonstrated that when one takes the approach of simultaneously segmenting choice
and non-choice data via Latent Gold’s syntax module, one can extend the segmentation in
several ways:
1) A joint segmentation on more than one set of basis variables can be performed. Using
more than one nominally scaled latent variable and allowing the correlation among the
latent variables to be estimated is the classic case of a joint basis (or joint objective)
segmentation;
2) One may include more than one choice task in the segmentation. The real-world example
segmented 2 MaxDiff tasks and non-choice data simultaneously into a common set of
segments described by a single nominally scaled latent variable; and
3) Although not demonstrated here, one may also extend the segmentation by including
MNL scale segments. Previous Sawtooth Software presentations (Magidson & Vermunt,
2007) have shown the segmentation approach to estimating relative scale differences
among segments of respondents that are appropriately estimated according to the MNL
model. These scale segments may or may not be allowed to be correlated with the other
nominally scaled latent variables derived.
This paper and presentation was:
1) Not a bake-off between methods.
2) Not a software recommendation. With some effort, similar analyses may be conducted
within MPLUS;
3) Not about using ensemble analysis to evaluate multiple solutions (which could obviously
be applied to either case); and, lastly,
4) The recommendation to use the simultaneous approach is based upon satisfying the
objectives of a segmentation using a segmentation method that does not require
sequential estimation, not the naïve simulations presented above.
When clear patterns of differentiation exist in data most any segmentation method will find
those patterns. However, when such clear differentiation does not exist every segmentation
method will find some solution—a solution influenced by the method itself and its
characteristics. This is the essence of one result in the research conducted by Dolnicar and Leisch
(2010) who state:
If natural clusters exist in the data, the choice of the clustering algorithm is
less critical . . .
If clusters are constructed, the choice of the clustering algorithms is
critical, because each algorithm will impose structure on the data in a
different way and, in so doing, affect the resulting cluster solution.
248
APPENDIX
Example of complete syntax for a single latent variable simultaneous choice/non-choice
model segmentation.
249
Thomas Eagle
REFERENCES
Dolnicar, S. and F. Leisch, (2010) “Evaluation of structure and reproducibility of cluster
solutions using the bootstrap,” Marketing Letters, Volume 21 (1), pp 83–101.
Magidson, J and J. Vermunt, (2002) “Latent class models for clustering: A comparison with Kmeans,” Canadian Journal of Marketing Research, Volume 20, pp 37–44.
Magidson, J and J. Vermunt, (2007) “Removing the Scale Factor Confound in Multinomial Logit
Choice Models to Obtain Better Estimates of Preference,” 2007 Sawtooth Software
Conference Proceedings, pp 139–154.
Ramaswamy, V.R., R. Chatterjee and S.H. Cohen, (1996) “Joint segmentation on distinct
interdependent bases with categorical data,” Journal of Market Research, Volume 33
(August), pp 337–55.
250
EXTENDING CLUSTER ENSEMBLE ANALYSIS VIA SEMI-SUPERVISED
LEARNING
EWA NOWAKOWSKA1
GFK CUSTOM RESEARCH NORTH AMERICA
JOSEPH RETZER
CMI RESEARCH, INC.
INTRODUCTION
Market segmentation analysis is perhaps the most challenging task market researchers face.
This is not necessarily due to algorithmic or computational complexity, data availability, lack of
methodological approaches, etc. The issue instead is that desirable market segmentation solutions
simultaneously address two issues:


Partition quality and
Facilitation of market strategy (actionability).
Various approaches to market segmentation analysis have attempted to address the multi-goal
issue described above (see, for example, “Having Your Cake and Eating It Too? Approaches for
Attitudinally Insightful and Targetable Segmentations” (Diener and Jones, 2009)). This paper
describes a completely new and effective approach to the challenge. Specifically we extend
cluster ensemble methodology by augmenting the ensemble partitions with those derived from a
supervised learning (Random Forest—RF) predictive model. The RF partitions incorporate
profiling information indicative of target measures that are most of interest to the marketing
manager. Segmentation membership is more easily identified on the basis of previously selected
attributes, behaviors, etc. of interest. Also, the consensus solution produced from the ensemble is
of high quality facilitating differentiation across segments.
1
Ewa Nowakowska, 8401 Golden Valley Rd, Minneapolis, MN 55427, USA, T: 515 441 0006 | [email protected]
Joseph Retzer, CMI Research Inc., 2299 Perimeter Park Drive, Atlanta, Georgia 30341, USA. T: 678 805 4013 | [email protected]
251
I. MARKET SEGMENTATION CHALLENGES
Figure 1.1 Standard Deck of Cards
Consider an ordinary deck of 52 playing cards shown in Figure 1.1. To illustrate the
challenges of market segmentation we may begin by asking the question:
“What is a high quality partition that might be formed from the deck?”
First, a few definitions:
Partition: In market segmentation a partition is nothing more than an identifier of group
membership, e.g., a column (vector) of numbers, one entry for each respondent in our dataset,
which identifies the segment to which our respondent is assigned.
High Quality: A high quality partition is one in which the groups identified are similar
(homogeneous) within a specific group while differing (heterogeneous) across groups.
A number of potential, arguably high quality, partitions may come to mind:





Red vs. Black cards (2 segment partition)
Face vs. Numeric cards (2 segment partition)
Clubs vs. Diamonds vs. Spades vs. Hearts (4 segment partition)
Aces vs. Kings vs. Queens vs. . . . (13 segment partition)
Etc.
Given multiple potential partitions, the question becomes “Which one should we use?” In the
case of the example above we might answer, “It depends on the game you intend to play.” More
generally, the market researcher would respond, “We choose the partition that best facilitates
marketing strategy.” Addressing the fundamental and nontrivial dilemma illustrated by this
simple example is the focus of this paper.
II. HIGH QUALITY AND ACTIONABLE RESULTS
As noted earlier, high quality cluster solutions exhibit high intra-cluster similarity and low
inter-cluster similarity. Achieving high quality clusters is the primary focus of most, if not all,
commonly used clustering methods. In addition, numerous measures of quality also exist e.g.,
252




Dunn Index
Hubert’s Gamma
Silhouette plot/value
Etc.
All of the above are in some way attempting to simultaneously reflect homogeneity within
and heterogeneity between segments. While various algorithms perform reasonably well in
achieving cluster quality, a new, computationally intensive ensemble approach, cluster ensemble
analysis, has been shown to produce high quality results on both synthetic and actual data.
Cluster ensembles also offer numerous additional advantages when applied to multi-faceted data
(see Strehl and Ghosh (2002), Retzer and Shan (2007)).
The second desirable outcome of a segmentation model, actionable results, is often more
difficult to achieve. The marketing manager cannot implement effective marketing strategy
without being able to predict membership in relevant groups, i.e., the segments must be
actionable. Actionability focuses on customer differences and similarities that best drive
marketing strategies. Marketing strategies might include such activities as:





Developing brand messages appealing to different customer types,
Designing effective ad materials,
Driving sales force tactics and training,
Targeted marketing campaigns, and
Speaking or reacting to concerns of most desirable customers.
It is important to note that actionability is not necessarily achieved by insuring a high quality
partition. Consider an illustration in which a biostatistician produces a genome-based partition
that perfectly segments gene sequences identifying hair color. Assume the resultant clusters are
made up of, for example:




Cluster 1: Blond hair respondents
Cluster 2: Brown hair respondents
Cluster 3: Red hair respondents
Etc.
Further assume that there is no overlap in hair colors across the clusters. A question we might
ask is “does this represent a desirable partition?” Clearly the answer would be “no” if in fact the
biostatistician was engaged in cancer research.
It is worth noting that occasionally researchers may be tempted to simply include customer
demographics (or other actionable information) directly into the segmentation as basis variables.
This is undesirable for a number of reasons including:
•
•
There is no reason to necessarily expect groups to form in such a way as to facilitate
identification of respondent membership in desired clusters. This may or may not happen
since we are focused only on simultaneous relationships between attributes as opposed to
relationships directly related to the prediction of a “market strategy facilitating” (e.g.,
purchasers vs. non-purchasers) outcome.
The introduction of additional/many dimensions in a segmentation analysis, which
requires simultaneous similarity across ALL dimensions in the dataset, typically leads to
a relatively flat (similar) cluster solution of poor quality.
253
In order to ensure the inclusion of information that is directly related to the prediction of
relevant market strategy variables (e.g., Purchaser vs. Non-purchaser), we turn to an analytical
approach well suited to prediction, Random Forest analysis.
III. MACHINE LEARNING TERMINOLOGY
Before going forward into more detail around the specifics involved in Semi-Supervised
Learning, a brief aside defining some commonly used machine-learning terminology is called
for. We provide a brief description of Unsupervised, Supervised and Semi-Supervised Learning
below.
Unsupervised Learning: Unsupervised learning is a process by which we learn about the
data without being supervised by the knowledge of respondent groupings. The data is referred to
as “unlabeled,” implying the groupings, whatever they may be, are latent. Common cluster
analysis algorithms are examples of unsupervised learning. There are many examples of such
algorithms including, e.g.,
•
•
•
•
Hierarchical methods
Non-hierarchical: k-means, Partitioning Around Medoids (PAM) etc.
Model Based: Finite Mixture Models
Etc.
Supervised Learning: Supervised learning on the other hand, is the process in which we
learn about the data while being supervised by the knowledge of the groups to which our
respondents belong. In supervised learning external labels are provided (e.g., purchaser vs. nonpurchaser). Supervised learning algorithms include such as methodologies as, e.g.,
•
•
•
•
•
Classification And Regression Trees (CART)
Support Vector Machines (SVM)
Neural Networks (NN)
Random Forests (RF)
Etc.
Semi-Supervised Learning: Semi-Supervised Learning involves learning about data where
complete or partial group membership in market strategy facilitating segments is known. In
essence semi-supervised learning combines aspects of both unsupervised and supervised
learning. The implementation of semi-supervised learning models may be performed in a variety
of ways. An example is found in “A Genetic Algorithm Approach for Semi-Supervised
Clustering” by Demiriz, Bennett and Embrechts. While this illustration employs partially labeled
data, requiring an additional intermediate step to assign labels to all data points, the underlying
approach is conceptually the same as is found in this presentation. Our implementation of semisupervised learning is achieved via cluster ensemble analysis.
IV. CLUSTER ENSEMBLES (UNSUPERVISED LEARNING)
Cluster ensembles or consensus clustering analysis is a computationally intense data mining
technique representing a recent advance in unsupervised learning analysis (see Strehl and Gosh
(2002)).
254
Cluster Ensemble Analysis (CEA) begins by generating multiple cluster solutions using a
collection of “base learner” algorithms (e.g., PAM (Partitioning around Medoids)), finite mixture
models, k-means, etc.). It next derives a “consensus” solution that is expected to be more robust
and of higher quality than any of the individual ensemble members used to create it. Cluster
Ensemble solutions exhibit low sensitivity to noise, outliers and sampling variations. CEA
effectively detects and portrays meaningful clusters that cannot be identified as easily with any
individual technique. CEA has been suggested as a generic approach for improving the quality
and stability of base clustering algorithm results (high quality cluster solutions exhibit high intracluster similarity and low inter-cluster similarity).
CEA provides a framework for incorporating numerous unsupervised learning ensemble
members (generated via PAM, k-means, hierarchical, etc.) as well as augmentation of the
unsupervised ensemble with partitions reflecting supervised learning analysis.
V. RANDOM FOREST ANALYSIS (SUPERVISED LEARNING)
The supervised learning information incorporated into the ensemble is provided by Random
Forest (RF) analysis. RF analysis is a tree-based, supervised learning approach suggested by
Breiman (2001). It represents an extension of “bagging” (bootstrap aggregation), also suggested
by Breiman (1996). RF’s are algorithmically similar to bagging, however additional randomness
is injected into the set of estimated trees by evaluating, at each node, a subset of potential split
variables selected randomly from the eligible group.
An intuitive description of RF analysis may be given as follows. Consider illustration 5.1
below,
Figure 5.1 Random Forests
Just as a single tree may be combined with many trees to create a forest, a single decision
tree may be combined with many decision trees to create a statistical model referred to as a
Random Forest. Categorical variable prediction in RF analysis is accomplished through simple
majority vote across all forest tree members. That is to say a respondents data may be “dropped”
through each of the RF trees and a count made as to the number of times the individual is
classified in any of the categories (by virtue of the respondent position in each tree’s terminal
255
nodes). Whichever category the respondent falls into most often is the winner and the respondent
is subsequently classified in that group.
Empirical evidence suggests Random Forests have higher predictive accuracy than a single
tree while not over-fitting the data. This may be attributed to the random predictor selection
mitigating the effect of variable correlations in conjunction with predictive strength derived from
estimating multiple un-pruned trees. RF’s offer numerous advantages over other supervised
learning approaches, not the least of which is markedly superior out-of-sample prediction. Other
advantages of RF analysis include:






Handles mixed variable types well.
Is invariant to monotonic transformations of the input variables.
Is robust to outlying observations.
Accommodates several strategies for dealing with missing data.
Easily deals with a large number of variables due to its intrinsic variable selection.
Facilitates the creation of a respondent similarity matrix.
VI. COMBINING SUPERVISED WITH UNSUPERVISED LEARNING (SEMI-SUPERVISED
LEARNING)
While it is clear that RF analysis is well suited to supervised learning and predictive analysis,
what may be less clear is how can it be used in the context of cluster ensembles to facilitate
semi-supervised learning. This process is described below.
As a first step, an n x n (where n is the total number of respondents in our study) null matrix
is created,
T = total number of trees.
Next, for each tree in the RF analysis (typically around 500 trees are created) holdout
observations are passed through the tree and respondent pairs are observed in terms of being
together or otherwise, in the trees’ terminal nodes. Specifically,


256
For each of T trees, if respondent i and respondent j (i ≠ j) both land in the same
terminal node, increase the i, jth element
by 1.
The final matrix, SRF,, is a count for every possible respondent pair, of the number of
times each landed in the same terminal node.
The resultant matrix then may be considered a similarity matrix of respondents based on their
predictive information pertaining to a selected marketing strategy facilitating measures.
Similarity matrices may be used to create cluster partitions. These partitions in turn may be
included in ensembles that also contain partitions based on purely unsupervised learning
analysis. It is in this way we are able to combine latent with observed group membership
information to produce a semi-supervised learning consensus partition.
It is important to note that we employed Sawtooth Software’s Convergent Cluster Ensemble
Analysis (CCEA) package to arrive at a consensus clustering. Not only does this facilitate the
creation of an unsupervised ensemble but in addition adds the stability afforded by a convergent
solution into the final result.
VII. SHOWCASE
1. The data and the task
This section presents the results obtained when applying the proposed method to real data.
The data described general attitudes to technology with special focus on mobile phones. The
primary goal was to see how the respondents segment on attitudes towards mobile phones, so the
attitudinal statements were considered the basis variables. An accurate cluster assignment tool
was also required, however, it was supposed to be based not on the category-specific statements
such as the attitudes but on more general questions such as lifestyle. Such general questions can
be asked in virtually any questionnaire, making the segments detectable regardless of the specific
subject of the survey. Hence, the basis and future predictive variables differed. Last but not least,
the segments were expected to profile well on behavioral variables, which in this case was
mobile phone usage, of key importance for segments’ understanding. This block of statements is
later referred to as additional profiling variables. Figure 7.1 shows the summary of the data
structure.
Figure 7.1 The data structure
As noted earlier, there are several reasons why inserting all these variables alike in a standard
segmenting procedure is likely to be unsuccessful. In addition, for this data set, behavioral
variables tend to dominate the segmenting process, producing very distinct yet hardly operational
partitions. Also, long lists of variables are typically not recommended, as the distinctive power
257
spreads roughly evenly among multiple items, resulting in solutions of flat and hardly
interpretable profiles.
2. The analysis
The analysis was conducted in three steps, where the first two can be considered parallel.
First, Sawtooth Software’s Convergent Cluster Ensemble Analysis (CCEA) was performed,
taking the base variables as the input. In this case default settings for ensemble analysis were
used. This produced not only the CCEA consensus solution, which was later used for
comparisons, but also a whole ensemble of partitions, which constitutes one of two pillars of the
method.
Second, Random Forests (RF) were used to predict the additional profiling variables with the
set of predictive variables. In this case, mobile phone usage was explained by lifestyle
statements. For each profiling variable there was a single Random Forest and each produced a
single similarity matrix, referred to also as a proximity matrix. In this case there were multiple
additional profiling variables of interest, hence there were also multiple similarity matrices. Each
similarity matrix was then used to partition the observations. For each of them combinatorial
(Partitioning Around Medoids (PAM)) as well as hierarchical (average linkage) clustering
algorithms were used, the number of clusters was also varied within the range of interest, which
was from 3 to 7 groups. The diversity was desired, as typically more diverse ensembles lead to
richer and more stable consensus solutions.
Altogether, it produced a large ensemble of RF-based partitions, which were then merged
with the original CCEA ensemble. The Convergent Cluster Ensemble Analysis was performed
again, now producing the semi-supervised CCEA consensus solution. The overview of the
process is presented in Figure 7.2.
Figure 7.2 The analytical process
The Random Forest analysis was programmed in R, primarily with the ‘randomForest’
package. The ensemble clustering was done with Sawtooth Software’s Convergent Cluster
258
Ensemble Analysis, which facilitated the whole process to a great extent. Ensemble clustering
can also be done in R with ‘clue’ package, which offers flexible and extensible computational
environment for creating and analyzing cluster ensembles, however most of the process must be
explicitly specified by the user, making it more demanding and time consuming. In Sawtooth
Software’s CCEA the user chooses between standard CCA (Convergent Cluster Analysis, via Kmeans) and CCEA (Convergent Cluster Ensemble Analysis). Selecting the latter, one needs to
make the key decision how the ensemble should be constructed. It can be done in the course of
the analysis, which we exploited in the first step of the process, but it can also be obtained from
outside sources, which we used in the last step extending the standard CCEA ensemble by the
RF-based partitions. The ultimate consensus solution was selected based on the quality report
including reproducibility statistics.
3. The results
Quality comparison. The semi-supervised CCEA (SS-CCEA) consensus solution was
compared to the standard attitude-based CCEA consensus solution obtained in the first step of
the analysis. The reported reproducibility equaled 72% for the standard CCEA and 86% for its
semi-supervised extension SS-CCEA, which was a relatively high result for both groupings
given the number of clusters and statements. The difference in favor of SS-CCEA is possibly due
to its more diverse and larger ensemble, however it tells us that extending the ensemble by the
RF-based partitions tends to increase reproducibility rather than spoiling what has originally
been a good CCEA solution.
6
44
158
64
1
6
432
67
25
80
2
13
125
383
50
99
3
111
8
131
337
47
4
60
77
192
103
215
5
2
3
4
5
CCEA
448
1
Figure 7.3 Transition matrix
SS−CCEA
Attitudinal (basis) variables. The matrix in Figure 7.3 is a transition matrix, which shows
the size of the intersection between the original CCEA and the SS-CCEA segments. The possible
issue of labels switching was taken care of and hence the diagonal represents the observations
that remained in the original segments after the semi-supervised modification. One can see that
the diagonal dominates, which indicates that both solutions have very much in common.
Therefore there are grounds to believe the semi-supervised part only slightly modified the
original CCEA solution rather than producing an entirely new partition.
259
Figure 7.4 Absolute profiles for CCEA and SS-CCEA solutions
att_13
att_13
att_12
att_14
att_12
att_11
att_15
●
●
att_10
●
●
●
●
●
att_16
●
●
●●
●
●
●
●
●
●
●●
●
●●
att_1
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
att_1
●
●
●
●●
●
●
●
●
●●
att_2
●
●
att_8
●
●
●
●●
●●
●
●
●
●
att_2
●
●
●
●
●
●
●●
att_3
●
att_6
●●
●
●
●
●
att_7
● ●
●●
●
●
●
●
●
●
att_8
●
●●
●
●●
att_9
●
●
●
●
●
●
●
●
●
att_9
●
●
●
●
att_16
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
att_10
●
●
●
att_15
●
●
●
●
●
att_11
●
●
●
●
att_14
●
att_7
att_4
att_6
att_5
att_3
att_4
att_5
This is confirmed by how similarly the segments profile on the basis (attitudinal) variables.
This can be observed in the radial plots in Figure 7.4 showing the absolute profiles, with shades
indicating segments. Hence altogether, given that CCEA consensus solution was of high quality,
the SS-CCEA is of high quality as well.
Usage (additional profiling) variables. Let us now compare the behavior of the additional
profiling variables, which was the category usage the segments were expected to profile on. To
measure this, we used Friedman’s importance (relevance) as described in the work of Friedman
and Meulman (2004). Intuitively, for each given variable and each given segment, it can be
thought of as the ratio of variance over all observations with respect to the variance within the
segment. Technically, the coefficient is defined in terms of spread, which is more general than
variance and uses also an additional normalizing factor but the idea remains. So, the large values
the more important the variable and the better the grouping profiles on it.
Figure 7.5 Additional profiling variables: Friedman’s importance
260
0.5
0.5
0.5
0.0
0.0
0.0
1.0
1.5
2.0
Cluster 5
0.5
1.0
1.5
2.0
Cluster 4
0.0
1.0
1.5
2.0
Cluster 3
0.5
1.0
1.5
2.0
Cluster 2
0.0
1.0
1.5
2.0
Cluster 1
The charts of Figure 7.5 show the values of the coefficient for the most important attributes
in the segments. Each chart corresponds to a segment and each bar to an attribute. The most
important attributes differ across the segments, which is indicated by varied shades of the
corresponding bars. The light sub-bars stacked on top indicate the increase in importance due to
the semi-supervised part. So, for instance the first attribute in the second segment is very
important and RFs virtually do not have effect on that. However the same attribute in the third
cluster is also most important but in this case RFs still increase its importance substantially.
Generally, we can observe that the semi-supervised modification increased the importance of
some of the attributes, so altogether the SS-CCEA solution profiles better on usage than the
standard CCEA partition did.
Lifestyle (predictive) variables. Finally, let us examine the predictive performance of the
extended SS-CCEA segmentation, which is the ability to assign each new coming observation to
its segment with sufficient accuracy. To build the cluster assignment model, the RF approach was
used again. So another Random Forest was built, to predict cluster membership. To assess its
performance the RF’s intrinsic mechanism for unbiased error estimation was exploited. In a
Random Forest each tree is constructed using a different bootstrap sample. So the cases that are
left out—the holdout sample, which is here called out-of-bootstrap (bag) (OOB)—are put down
the tree for classification and used to estimate the error rate. This is called the OOB error rate.
Figure 7.6 OOB error rates
Figure 7.6 presents the OOB error rates for both solutions as well as their change with the
increase in the number of trees in the forest. On the x-axis the number of trees in the forest is
given, varying from 0 to 100. The y-axis shows the OOB error rates, the j-th element showing
the result for the forest up to the j-th tree. The chart not only shows that in terms of the
classification error rates the SS-CCEA tends to outperform the standard CCEA but it also
reminds of the role of the size in the forest, as up to the certain point the error rates drop
substantially with the increase in the number of trees. In brackets the OOB error rates for the
entire forest (i.e., 100 trees) are given. This means that for SS-CCEA the cluster membership can
be predicted with almost 85% accuracy, which is a very good result.
What happens often though, is that although the overall classification error rate might
decrease, it decreases only on average. In other words, this means that while some of the
segments are detected more accurately, for some others we observe decrease in the classification
261
accuracy. However, this is not the case here. The chart of Figure 7.7 pictures the values in the
table and shows that the error rates for all the segments consistently drop. So not only the overall
error rate decreases but it also decreases for each cluster.
Figure 7.7 OOB error rates per segment
Segm 1
Segm 2
Segm 3
Segm 4
Segm 4
CCEA
0.19
0.27
0.30
0.32
0.34
SS-CCEA
0.09
0.14
0.16
0.15
0.25
4. Profiling & Prediction
In virtually any segmentation study, an important goal is to enable the marketing manager to
profile respondents and interactively predict membership of new individuals into derived
clusters. RF analysis is employed for both purposes.
RF analysis provides powerful diagnostics and descriptive measures for example,



Evaluating attribute importance.
Predicting new respondent cluster membership, etc.
Profiling cluster solutions via partial dependence plots.
One profiling capability, importance measurement, is briefly described below.
In order to calculate importance for a given attribute of interest, its value is randomly
permuted in each holdout (aka OOB, Out-Of-Bag) sample. Next the OOB sample data is run
through each tree and the average increase in prediction error across all T trees is calculated as:
This measure of average increase in predictive error also serves as the attribute’s importance
measure. The reasoning is straightforward, if a variable is important to the prediction of the
dependent measure, permuting its values should have a relatively large, negative impact on
predictive performance.
The charts of Figure 7.8 show the two basic measures of importance. Mean Decrease
Accuracy is described above, while Mean Decrease Gini captures the decrease in node impurities
resulting from splitting on the variable, also averaged over all trees. This information was used to
select top most important statements to employ them in the typing tool described below.
262
Figure 7.8 Importance of the explanatory variables for the Random Forest model
Q1X26
●
Q1X27
●
Q1X5
●
Q1X29
●
Q1X30
●
Q1X16
●
Q1X26
●
Q1X27
●
Q1X5
●
Q1X29
●
Q1X19
●
Q1X16
Q1X22
●
Q1X10
Q1X8
●
Q1X15
●
●
●
Q1X19
●
Q1X22
●
Q1X1
●
Q1X20
●
Q1X13
●
Q1X20
●
Q1X30
●
Q1X13
●
Q1X4
●
Q1X1
●
Q1X9
●
Q1X18
●
Q1X10
●
Q1X18
Q1X15
●
●
Q1X4
●
Q1X9
●
20
Q1X8
30
40
50
60
MeanDecreaseAccuracy
●
0
100
200
300
MeanDecreaseGini
Prediction in RF analysis has been described previously as a straightforward majority vote
across multiple decision trees in a Random Forest. A practical consideration however involves
how such a algorithm may be implemented for use by the marketing manager.
The above task was accomplished in a very effective manner by construction of an
interactive, web browser based interface, (which may be locally or remotely hosted) allowing
access to an R statistical programming language object. More specifically, by utilizing the
capabilities afforded in the R package “shiny” we were able to construct a powerful yet easy to
use interface to a Random Forest predictive object.
In addition to prediction of “relative” cluster membership, the tool also allowed for
identification of attribute levels leading to maximal probability of membership in any one of the
selected groups. These attribute levels are identified via application of a genetic algorithm search
in which specific class membership served as the optimization target. A screenshot of the
predictive tool is shown in Figure 7.9.
263
Figure 7.9 Segment prediction tool
VIII. SUMMARY
In this paper we extend the ensemble methodology in an intelligent and practical way to
improve the consensus solution. We employ Sawtooth Software’s CCEA (“Convergent” Cluster
Ensemble Analysis) so as to both improve consensus partition stability and facilitate ease of
estimation. The set of ensemble partitions are augmented with partitions derived from the
supervised learning analysis, Random Forests. Supervised learning partitions are created from a
similarity matrix based on Random Forest decision trees. These partitions are particularly useful
in that they incorporate profiling information directly indicative of analyst-chosen target
measures (e.g., purchaser vs. non-purchaser).
We compare/contrast Semi-Supervised Convergent Cluster Ensembles Analysis (SS-CCEA)
with alternate solutions based on both cluster profiles and out-of-sample post hoc prediction.
Post hoc prediction of cluster membership improves unilaterally across clusters while cluster
profiles show minimal changes.
Finally, we implement an interactive, browser based cluster simulation tool using the R
“shiny” package. The tool enables direct access to the Random Forest object, which in turn
produces superior predictive results. The tool may be hosted either locally or remotely allowing
for greater flexibility in deployment.
Our approach may be referred to as “semi-supervised learning via cluster ensembles” or, in
this case “Semi-Supervised Convergent Cluster Ensembles Analysis” (SS-CCEA).
Importantly, this paper highlights the ease of performing the analysis through the use of
Sawtooth Software’s CCEA package. CCEA both empowers the practitioner to efficiently
perform ensemble analysis as well as allowing for simple/easy augmentation of the ensemble
with externally derived partitions (in this case produced via supervised learning, RF, results).
264
Ewa Nowakowska
Joseph Retzer
IX. REFERENCES
L. Breiman, Bagging predictors, Machine Learning, 24:123–140, 1996.
L. Breiman, Random forests, Machine Learning, 45(1): 5–32, 2001.
A. Demiriz, K. Bennett and M. Embrechts, A Genetic Algorithm Approach for SemiSupervised Clustering, Journal of Smart Engineering System Design, 4: 35–44, 2002.
C. Diener and U. Jones, Having Your Cake and Eating It Too? Approaches for Attitudinally
Insightful and Targetable Segmentations, Sawtooth Software Conference Proceedings, 2009.
J. Friedman and J. Meulman, Clustering Objects on Subsets of Attributes, Journal of the
Royal Statistical Society: Series B, 66 (4): 815–849, 2004.
J. Retzer and M. Shan (2007), “Cluster Ensemble Analysis and Graphical Depiction of
Cluster Partitions,” Proceedings of the 2007 Sawtooth Software Conference Proceedings.
A. Strehl and J. Ghosh, Cluster ensembles—a knowledge reuse framework for combining
multiple partitions, Machine Learning Research, 3: 583–417, 2002.
265
THE SHAPLEY VALUE IN MARKETING RESEARCH:
15 YEARS AND COUNTING
MICHAEL CONKLIN
STAN LIPOVETSKY
GFK
We review the application of the Shapley Value to marketing research over the past 15 years.
We attempt to provide a comprehensive understanding of how it can give insight to customers.
We outline assumptions underlying the interpretations so that attendees will be better equipped to
answer objections to the application of the Shapley Value as an insight tool.
Imagine it is 1998. My colleague Stan Lipovetsky, is working on a TURF analysis (Total
Unduplicated Reach and Frequency) for product line optimization. Stan, being new to marketing
research, asked the obvious question—“What are we trying to do with the TURF analysis?”
TURF1 is a technique that was first used in the media business to understand which
magazines to place an advertisement in. The goal was to find a set of magazines that would
maximize the number of people who would see your ad (unduplicated reach) as well as
maximizing the frequency of exposure among those who were reached. This was adapted for
marketing research for use in product line optimization. Here, the idea was to find a set of
products to offer in the marketplace such that you would maximize the number of people who
would buy at least one of those products. The general procedure at the time was to ask
consumers to give a purchase interest scale response for each potential flavor in a product line.
Then the TURF algorithm is run to find the pair of flavors that maximizes reach (the number of
people who will definitely buy at least one product of the two), the triplet that maximizes reach,
the quad that maximizes reach and so on. TURF itself is an np-hard problem. To be sure you
have found the set of n products that maximizes reach you must calculate the reach for all
possible sets of n.
Stan looked at the calculations we were doing for the TURF analysis and said, “This reminds
me of something I know from game theory, the Shapley Value.” “So, what is the Shapley Value?”
I asked. And so began a 15-year odyssey into the realm of game theory and a single tool that has
turned out to be very useful in a variety of situations.
THE SHAPLEY VALUE
Shapley first described the Shapley Value in his seminal paper in 1953.2 The Shapley Value
applies to cooperative games, where players can band together to form coalitions, and each
coalition creates a value by playing the game. The Shapley Value allocates that total value of the
game to each player. By evaluating over all possible coalitions that a player can join in, a value
for each specific player can be derived.
1
2
Wikipedia
(Shapley, 1953)
267
Formally the Shapley Value for player i is defined as:
i 
i )
   S   (S  
n( s )
S - all subsets
( s  1)!(n  s)!
n!
So, summing across all possible subsets of players S, the value of player i is the value of the
game for a subset containing player i minus the value of that same subset of players without
player i. In other words, it is the marginal value of adding the player to any possible set of other
players. The summation is weighted by a factor that reflects the number of subsets of a particular
size (s) that are possible given the total number of players (n).
Where:
 n( s) 
When we apply the concept to the TURF game we have a situation where we create all
possible sets of products, and calculate the “value” of each set by determining its “reach,” or the
percent of consumers in the study who would buy at least one item in the set. By applying the
Shapley Value calculation to this data we can allocate the overall reach of all of the items to the
individual items. This gives us a relative “value” of each individual product. The values of these
products add up to the total value of the game, or the reach, of all of the products.
The fact that we can apply this calculation to the TURF game doesn’t necessarily mean that it
is useful. And, it certainly appears that the Shapley Value is an np-hard problem as well. We need
to calculate the overall reach or value of every possible subset of products to even calculate the
Shapley Value for each product.
Fortunately, the TURF game corresponds to what is known in game theory as a simple game.
A simple game has a number of properties. In a simple game, the value of a game is either a 1 or
a 0. All players in a coalition or team that produce a 1 value have a Shapley Value of 1/r where r
is the number of players in the team that can produce a win. In the TURF context, a consumer is
reached by a subset of products. Those products all get a Shapley Value of 1/r where r is the
number of products that are in that subset. All other products get a Shapley Value of 0.
Another property of simple games is that they can be combined. In our TURF data, we treat
each consumer as being a simple game. To combine the simple games represented by the
consumers in our study, we calculate the Shapley Value for each product for each consumer and
then average across consumers.
We solve the problem of how to calculate the Shapley Value for TURF problems by
considering the TURF game as a simple game. But we still are not sure what this “value”
represents. For this we need to look at the problem from a marketing perspective.
A SIMPLE MODEL OF CONSUMER BEHAVIOR
Consider this simple model of consumer behavior:
1. A consumer plans to buy in the category and enters the store.
2. She reviews the products available and identifies a small subset (relevant set) that have
the possibility of meeting her needs.
3. She randomly chooses a product from that subset.
268
Now clearly most of us are not explicitly using some random number generator in our heads
to choose which product to buy when we visit the store. Instead we evaluate the products
available and choose the one that maximizes our personal utility, that is, we choose the product
we prefer . . . at that moment. The product that will maximize our utility depends upon several
factors. One factor is the benefits that the particular product delivers. A second factor is the
benefits delivered by other competing products that are available in the store. Benefits delivered
are evaluated in the context of needs. If one has no need for a benefit then its utility is nonexistent. If one has a great need for a particular benefit then a product delivering that benefit will
have a high utility and a good chance of being the utility maximizing choice.
When we observe consumer purchases, for example by looking at data from a purchase
panel, one can see that the specific products available, and their benefits, stay relatively constant,
but nonetheless, consumers seem to buy different products on different trips to the store. This
would seem to indicate that the driver of choice is the degree to which a person’s needs change
from trip to trip. Hypothetically, we can map an individual’s needs to specific products that
maximize utility when that need is present. This means that if we can observe the different
products that a person purchases over some time period, then we can infer that those purchases
are a result of the distribution of need states that exist for that consumer.
If the distribution of need states for a specific consumer were such that the probability of
choosing each product in the relevant set was equal then the purchase shares of each product
would be the equivalent of the Shapley Value of each product. Therefore, we can think of the
Shapley Value calculation as a simple choice model, where the probability of choosing a
particular product is 0 for all products not in the relevant set and 1/r for all r products in the
relevant set.
An alternative to the Shapley Value calculation would be to estimate the specific probabilities
of choosing each product using a multinomial logit discrete choice model. If, we can estimate the
probabilities of purchase for each product for each consumer, then this should be a superior
estimate of purchase shares since the probabilities estimated in this manner would not be
arbitrarily equal for relevant products and would not be uniformly zero for non-relevant
products. But, is it feasible, in the context of a consumer interview, to obtain enough choice data
to accurately estimate those probabilities of purchase, especially if the product space is large? In
addition, it is not possible in the course of a 20-minute interview to ask consumers to realistically
make choices across multiple need states.
APPLICATION OF THE SHAPLEY VALUE TO CONSUMER BEHAVIOR
If we weight the consumers in our study by the relative frequency of category purchase and
units per purchase occasion then the Shapley Value becomes directly a measure of share of units
purchased. This moves the Shapley Value from being an interesting research technique to being a
very useful business management tool.
Anecdotally, we understand that category managers at retailers obtain a ranked sales report
for their category and consider the items that make up the bottom 20% of volume to be
candidates for delisting or being replaced in the store. Since the Shapley Value provides an
estimate of the sales rate for each product (in any combination), we can create a more viable
recommendation for a product line. Instead of choosing products that maximize “reach,” we can
269
use a dual rule of maximizing reach subject to the restriction that no products in the line fall into
the bottom 20% of volume overall.
To effectively do this analysis, one needs to collect data a little differently from TURF. In a
typical TURF study one asks respondents to give some purchase interest measure to each of the
prospective products that would go in the product line. A consumer is counted as “reached” if she
provides a top-box response to the purchase interest question. The problem with this approach is
two-fold. First, the questioning procedure is very tedious, especially as the number of products in
your product line increases. For that very reason, competitive brands are not typically included.
But, competitive brands are critical. Those are the products you want to replace on the retailer’s
shelves. The Shapley Value analysis can show you which of the competitor’s products your
proposed line should displace, but it can only do so if you have included the competitive
products in your study.
Our suggestion is to ask respondents which products, from the category, they have purchased
in some limited time period. (The time period should be dependent on the general category
frequency of purchase). This data can be used to calculate Shapley Values and optimize a product
line if all we are considering are existing products in the marketplace.
When considering new product concepts the problem is how to reliably determine if a new
product would become part of a consumer’s relevant set. This is especially problematic since
consumers are well known to overstate their interest in new product concepts. A method we have
found effective is to ask the typical purchase intent question for the new product and supplement
it by asking consumers to rank order the new concept amongst the other products they currently
buy (i.e., the ones selected in the previous task). We count a new product as entering an
individual consumer’s relevant set if, and only if, they rated it top box in purchase intent and
they ranked it ahead of all currently bought products. In our experience, this procedure appears
to produce reasonable estimates from the Shapley Value. (Since there is no actual sales data in
these cases a true validation has not been possible).
GOING BEYOND TURF—OTHER APPLICATIONS OF THE SHAPLEY VALUE
Recall that the Shapley Value is a way of allocating the total value of a game to the
participants in a fair manner. There are plenty of situations where we only know the total value
of something but we want to understand how that value can be allocated to the components that
create that value. One clear example is linear regression analysis. Here we want to understand
the value that each predictor has in producing the overall value of the model. The overall value
of the model is usually measured by the R2 value. Frequently we wish to allocate that overall R2
value to the predictors to determine their relative importance.
In 2000, my colleague Stan was working with one method of evaluating the importance of
predictors, the net effects. Net effects are a decomposition of the R2 defined as:
where the betas are vectors of standardized regression coefficients and R is the correlation
matrix of the predictor variables. The NE vector, when summed equals the R2 of the model. This
particular decomposition of R2 is problematic when there is a high degree of multicollinearity
amongst the predictors. In those cases there can often be a sign reversal in the beta coefficients
270
which can cause the net effect for that predictor to be negative. This makes the interpretation of
the net effects as an allocation of the total predictive power of the model illogical.
My experience with the Shapley Value caused me to wonder if the Shapley Value might be a
solution to this problem. The Shapley Value is an allocation of a total value. The individual
Shapley Values will therefore sum to that total value, and they will all be positive. We can easily
(although less easily than the line optimization case) calculate the incremental value of each
predictor across all combinations of predictors.
In the Shapley Value equation we substitute for the value term the R2 of each model:
i 
  R
n( s)
S - all subsets
2
S
 RS2{i}

This is no longer a simple game in the parlance of game theory so it becomes an np-hard
problem again. But, for sets of predictors that are smaller than 30 it is a reasonable calculation on
modern computers.
I was convinced that this was an excellent idea. As is often the case with excellent ideas, it
turned out that there were many others doing research in other fields who had also come up with
essentially the same idea.345 Many other related techniques also appear in the literature.
We did, however, take the approach one step further. Going back to the net effects
decomposition discussed earlier we realized that both of these techniques, Net Effects and
Shapley Value were trying to do the same thing: allocate the overall model R2 to the individual
predictors. So, if we assume that the Shapley Values are approximations of the Net Effects then
we can “reverse” the decomposition and calculate new beta coefficients so that they are as
consistent as possible with the Shapley Values.
This requires a non-linear solver but we can estimate a new set of beta coefficients that result
in Net Effects that are very close to the Shapley Values. These new coefficients can then be used
in a predictive model.
Gromping and Landau have criticized this approach.6 We show in a rejoinder7 that in
conditions of high multicollinearity, the model with the adjusted beta coefficients as described
above does a better job of predicting new data than the standard OLS model. We do recommend
only utilizing the adjusted coefficients in those extreme conditions.
Of course, there are other decompositions of R2 in the literature besides the Net Effects
decomposition. One decomposition, which was first described by Gibson8 and later rediscovered
by Johnson9, decomposes the R2 as follows:
3
(Kruskal, 1987)
(Budescu, 1993)
5
(Lindeman, Merenda, & Gold, 1980)
6
(Gromping & Landau, 2009)
7
(Lipovetsky & Conklin, 2010)
8
(Gibson, 1962)
9
(Johnson, 1966)
4
271
This produces two identical vectors of weights ω that when squared, sum to the R2 of the
model. These can be interpreted as importance weights and are very close approximations to the
Shapley Values. The advantage of using this approximation of the Shapley Values for importance
is that this particular decomposition is not an np-hard problem like the Shapley Value calculation
and therefore is much easier to compute with large numbers of predictors.
MOVING ON FROM LINEAR REGRESSION—OTHER ALLOCATION PROBLEMS
One of the nice things about the Shapley Value is that the “value function” is abstract. You
can define value in any way that you want, turn the Shapley Value crank and output an allocation
of that value to the component parts.
Consider the customer satisfaction problem. The Kano theory of customer satisfaction10
suggests that different product benefits have different types of relationships to overall
satisfaction.
Graphic by David Brown—Wikipedia
Identifying attributes that are “basic needs” or “must-be” attributes is critical in customer
satisfaction research. These are the items that cause overall dissatisfaction if, and only if, you fail
to deliver. The interesting thing about these attributes is that they are non-compensatory, that is,
if you fail to deliver on any one of these attributes you will have overall dissatisfaction, no matter
how well you perform on other attributes.
10
(Kano, Seraku, Takahashi, & Tsuji, 1984)
272
Standard linear regression driver model approaches clearly don’t work here. There are two
issues, first a linear regression model is inherently compensatory, and second, the vast majority
of the data is located in the upper right quadrant of the graph above.
As a result, we construct a model like this:
First—let , ,
represent customers dissatisfied with A,B,C . . . K respectively.
Also let represent customers dissatisfied overall.
We want to find a set of items such that
In other words, dissatisfaction with A or B or C implies dissatisfaction overall.
One way of evaluating this is by calculating the reach into . In other words, the percent of
dissatisfied people that are dissatisfied with any item in the set. But, this cannot be the end of
the calculation because we need to subtract from this the percent of people who are satisfied
overall with but are dissatisfied with one of the items in the set. In other words we need to
subtract the false positive rate. This statistic is known as Youden’s J11 and we can use it to
evaluate any dissatisfaction model of the form noted above.
In our case, we treat Youden’s J statistic as the “value” of the set of items. We can search for
the set of items that maximizes Youden’s J and then use the Shapley Value calculation to allocate
that value to the individual items.12 This provides a priority for improvement.
SUMMARY
Since we started using the Shapley Value in marketing research problems a decade and one
half ago we have found it to be a very useful technique whenever we need to allocate a total
value to component parts.
In the case of line optimization it immediately generalizes to a reasonable model of consumer
behavior making it an extremely useful business management tool. Other applications have also
proved to be quite useful. Business management, after all, seems to be primarily about
prioritization and the Shapley Value procedure provides a convenient way to prioritize the
components of many business decisions when direct measures of value of those components are
not available.
11
12
(Youden, 1950)
(Conklin, Powaga, & Lipovetsky, 2004)
273
Michael Conklin
REFERENCES
Budescu, D. (1993). Dominance Analysis: a new approach tot he problem of relative importance
in multiple regression. Psychological Bulletin, 114:542–551.
Conklin, M., Powaga, K., & Lipovetsky, S. (2004). Customer Satisfaction Analysis:
identification of key drivers. European Journal of Operational Research, 154: 819–827.
Gibson, W. A. (1962). On the least-squares orthogonalization of an oblique transformation.
Psychometrika, 11:32–34.
Gromping, U., & Landau, S. (2009). Do not adjust coefficients in Shapley value regression.
Applied Stochastic Models in Business and Industry.
Johnson, R. M. (1966). The Minimal Transformation to Orthonormality. Psychometrika, 61–66.
Kano, N., Seraku, N., Takahashi, F., & Tsuji, S. (1984). Attractive quality and must-be quality.
Journal of the Japanese Society for Quality Control, 39–48.
Kruskal, W. (1987). Relative Importance by Averaging over Orderings. The American
Statistician, 41:6–10.
Lindeman, R. H., Merenda, P. F., & Gold, R. Z. (1980). Introduction to Bivariate and
Multivariate Analysis. Glenview, Il: Scott, Foresman.
Lipovetsky, S., & Conklin, M. (2010). Reply to the paper “Do not adjust coefficients in Shapley
value regression.” Applied Stochastic Models in Business and Industry, 26: 203–204.
Shapley, L. S. (1953). A Value for n-Person Games. In e. A. H.W. Kuhn & A.W. Tucker,
Contributions to the Theory of Games, Vol II. (pp. 307–17). Princeton, NJ: Princeton
University Press.
Youden, W. (1950). Index for rating diagnostic tests. Cancer, 3: 32–35.
274
DEMONSTRATING THE NEED AND VALUE FOR A MULTI-OBJECTIVE
PRODUCT SEARCH
SCOTT FERGUSON1
GARRETT FOSTER
NORTH CAROLINA STATE UNIVERSITY
ABSTRACT
The product search algorithms currently available in Sawtooth Software’s Advanced
Simulation Module focus on optimizing product line configurations for a single objective. This
paper demonstrates how multi-objective product search formulations can significantly influence
and form a design strategy. Limitations of using a weighted sum approach for multi-objective
optimization are highlighted and the foundational theory behind a popular multi-objective
genetic algorithm is described. Advantages of using a multi-objective optimization algorithm are
shown to be richer solution sets and the ability to comprehensively explore tradeoffs between
numerous criteria. Opportunities for enforcing commonality are identified, and the advantage of
retaining dominated designs to accommodate un-modeled problem aspects is demonstrated. It is
also shown how linking visualization and optimization tools can permit the targeting of specific
regions of interest in the solution space and how a more complete understanding of the necessary
tradeoffs can be achieved.
1. INTRODUCTION
Suppose a manufacturer is interested in launching a new product. Richer product design
problems driven by estimates of heterogeneous customer preference are possible because of
advancements in marketing research and increased computational capabilities. However, when
heterogeneous preference estimates are considered, a single ideal product for an entire market is
not possible. Rather, a manufacturer must offer a product line to meet the diversity of the market.
Initial steps toward launching a product line might involve a manufacturer identifying its
manufacturing capabilities, contacting possible suppliers, determining likely cost structures, and
benchmarking the market competition. To understand how potential customers might respond to
different product offerings, a choice-based conjoint study can then be fielded (using SSI Web [1],
for example) to survey thousands of respondents. Part-worths for the different product attribute
levels are then estimated (using Sawtooth Software’s CBC/HB module [2], for example) and the
task of determining the configuration of each product begins.
Though armed with a wealth of knowledge about the market, the manufacturer may still be
unsure of exactly which attribute combinations will create the best product line. Rather than
assessing random product configurations, the manufacturer decides to use optimization to search
for the best configuration. The standard form of an optimization problem statement is shown in
Equation 1, where
represents the objective function to be minimized. is the vector of
design variables that defines the configuration of each product;
represents possible
1
[email protected]
275
inequality constraints;
represents the equality constraints, and the final expression describes
lower and upper bounds placed on each of the n design variables
.
(1)
1.1 Setting up a single objective product search
Sawtooth Software’s Advanced Simulation Module (ASM) [3] offers product search
capabilities as part of SMRT. Information needed to conduct the product search includes:

attribute levels to be considered for each product (the design variables)

estimates of respondent part-worths

attribute cost information

the number of products to search for

competing products/the “none” option

size of the market
To illustrate the limited information gained from a single objective optimization, consider the
hypothetical design scenario of an MP3 player product line. As previously stated, one of the first
steps associated with product line design is identifying the product attributes (and levels)
considered. For this example, product attributes are shown in Table 1, and the cost of each
attribute level is shown in Table 2. To solve this configuration problem, four products are to be
designed, respondent part-worths are estimated using Sawtooth Software’s CBC/HB module, and
the “None” option is the only competition considered. Overall product price is calculated by
multiplying attribute cost by 1.5 and adding a constant base price of $52. The market size for this
simulation is 10,000 people.
276
Table 1. MP3 Player product attributes considered
DV
X1
X2
X3
X4
X5
X6
X7
XP
Level
Photo/Video/Camera
Web/App/Ped
Input
Screen
Size
Storage
Background
Color
Background
Overlay
Price
1
None
None
Dial
1.5 in
diag
2 GB
Black
2
Photo only
Web only
Touchpad
2.5 in
diag
16 GB
White
3
Video only
App only
Touchscreen
3.5 in
diag
32 GB
Silver
4
Photo and Video Only
Ped only
Buttons
4.5 in
diag
64 GB
Red
5
Photo and Lo-res
camera
Web and App
only
5.5 in
diag
160 GB
Orange
$399
6
Photo and Hi-res
camera
App and Ped
only
6.5 in
diag
240 GB
Green
$499
7
Photo, Video and Lores camera
Web and Ped
only
500 GB
Blue
$599
8
Photo, Video and Hires camera
Web, App, and
Ped
750 GB
Custom
$699
No pattern /
graphic
overlay
Custom
pattern
overlay
Custom
graphic
overlay
Custom
pattern and
graphic
overlay
$49
$99
$199
$299
Table 2. MP3 Player product attribute cost
Level
Photo/Video/Camera
1
2
Screen
Size
Storage
Background
Color
Background
Overlay
$0.00
$0.00
$0.00
$0.00
$12.50
$22.50
$30.00
$35.00
$22.50
$60.00
$100.00
$125.00
$5.00
$5.00
$5.00
$5.00
$2.50
$5.00
$7.50
$150.00
$5.00
$175.00
$200.00
$5.00
$10.00
Web/App/Ped
Input
$0.00
$0.00
$0.00
3
4
5
$2.50
$5.00
$7.50
$8.50
$10.00
$10.00
$5.00
$20.00
$2.50
$20.00
$10.00
6
$15.00
$15.00
$40.00
7
8
$16.00
$21.00
$15.00
$25.00
The final step in setting up the optimization problem is defining the objective function. When
using ASM, four options are available: 1) product share of preference, 2) revenue, 3) profit, and
4) cost. Choosing from among these four options, however, is not an easy task. In addition to
preference heterogeneity, manufacturers must also respond to the challenges of conflicting
business goals. For example, market share of preference provides no insight into profitability. A
share increase could simply be achieved by lowering product prices. Conversely, it may be
possible to increase profits by increasing product prices. While more money would be made per
sale, this price increase will likely have a negative impact on share of preference.
1.2 Illustrating the limitation of a single objective search
Suppose a manufacturer initially chooses profit as the objective to maximize. Sorted by price,
the product configurations returned as the optimal solution are shown in Table 3. For this product
line offering, share of preference is 82.16% and profit is $1.355 million.
277
Not wanting to make a decision without sampling different regions of the solution space, the
manufacturer optimizes share of preference. The results from this optimization are shown in
Table 4, where share of preference is 96.33% and profit is $1.070 million.
Table 3. Optimal product line when maximizing profit
Share of preference:
Profit:
82.16%
$1.355 million
Product configurations
Photo,video
and hi-res
camera
Web and
App
Dial
1.5 in diag
32 GB
Photo,video
and hi-res
camera
Web and
App
Touchscreen
4.5 in diag
Photo,video
and hi-res
camera
Web, App,
and Ped
Touchscreen
Photo,video
and hi-res
camera
Web, App,
and Ped
Touchscreen
Price
Silver
Custom pattern
and graphic
overlay
$222.25
160 GB
Silver
Custom graphic
overlay
$391
4.5 in diag
500 GB
Black
Custom pattern
and graphic
overlay
$469.75
6.5 in diag
750 GB
Green
Custom pattern
and graphic
overlay
$529.75
Table 4. Optimal product line when maximizing share of preference
Share of preference:
Profit:
96.33%
$1.070 million
Product configurations
Photo,video
and hi-res
camera
Web and
App
Dial
1.5 in diag
16 GB
Photo,video
and hi-res
camera
Web and
App
Dial
4.5 in diag
Photo,video
and hi-res
camera
Web and
App
Touchscreen
Photo,video
and hi-res
camera
Web, App,
and Ped
Touchscreen
Price
Silver
Custom pattern
and graphic
overlay
$166
16 GB
Silver
Custom graphic
overlay
$207.25
4.5 in diag
16 GB
Black
Custom pattern
and graphic
overlay
$241
6.5 in diag
32 GB
Custom
No pattern or
graphic overlay
$316
Configuration differences in product line solutions are shown by the shaded cells. While both
solutions offer one product that is under $300, the remaining products in Table 3 are more
expensive than any of those in Table 4. The most significant difference comes in the Storage
attribute, where the products in Table 3 have increased storage sizes that significantly drive up
product price. The increased per-attribute profit coming from storage in this solution offsets the
decrease in overall share. To be more competitive with the None option and capture as much
share as possible, the configurations in Table 4 are less expensive. Additionally, results in both
tables suggest significant opportunities for enforcing commonality, a notion that will be explored
further in Section 4.
An ideal solution would simultaneously maximize both market share of preference and
profit. However, the results in Tables 3 and 4 verify that profit and share of preference are
conflicting objectives—that is, to increase one a sacrifice must occur in the other. Yet, as shown
in Figure 1, the ability to describe the nature of this tradeoff is extremely limited. Without
278
additional information it is impossible to make any statements about the region between these
points. This tradeoff leads to a variety of questions that must be answered:

Is the tradeoff between the objectives linear?

What is the right balance that should be achieved between these objectives?

Are these the only objectives that should be considered?
Figure 1. Comparing the results of the single objective optimizations
1.3 Posing a multi-objective optimization problem
In problems with multiple competing objectives the optimum is no longer a single solution.
Rather, an entire set of non-dominated solutions can be found that is commonly known as the
Pareto set [4]. A solution is said to be non-dominated if there are no other solutions that perform
better on at least one objective and perform at least as well on the other objectives. A solution
vector is said to be Pareto optimal if and only if there does not exist another vector for which
Equations 2 and 3 hold true.
for i = 1 .. t
(2)
for at least one i, 1 < i < t
(3)
Building upon the formulation introduced in Equation 1, a multi-objective problem
formulation is shown in Equation 4. In this equation, t represents the total number of objectives
considered.
(4)
279
While Sawtooth Software’s ASM does not currently support the simultaneous optimization of
multiple objectives, the engineering community has frequently used such problem formulations
to explore the tradeoffs between competing objectives [5–10]. Multidimensional visualization
tools have also been created that facilitate the exploration of a large number of solutions and the
ability to focus on interesting regions of the solution space [11–13]. This provides the
opportunity for additional insights into the tradeoffs between objectives that can be especially
helpful in the early stages of design when a product strategy is still being formed.
The goal of this paper is to provide an introduction into how the set of non-dominated
solutions can be found in a multi-objective problem and demonstrate the benefits of having this
additional information.
2. FOUNDATIONAL BACKGROUND
Finding the solution to a problem with multiple competing objectives requires a set of nondominated points to be identified. This is done by sampling possible solutions in the design
space. In product line design problems, the design space is where product configurations are
established.
Definition:
Design space—Referring back to Equation 4, the design space is an ndimensional space that contains all possible combinations of the design
variables.
Design variables can either be continuous or discrete. For the MP3 player example defined in
the previous section, it is assumed that only discrete attribute levels are considered. This creates
an n-dimensional grid that must be sampled to find the most effective design. A two-dimensional
view of this grid is shown in Figure 2.
Each point in the design space is a unique design. Evaluating the performance of a design
point across all t objectives defines a specific one-to-one mapping to the performance space.
Definition:
Performance space—Quantifies the value of a combination of design variables
(a design) with respect to each system objective.
As shown on the right in Figure 2, the set of non-dominated solutions in the performance
space leads to the identification of the Pareto set (often called the Pareto frontier in two
dimensions).
280
Figure 2. Representing the design and performance space in multi-objective optimization
2.1 Locating the Pareto set using a linear combination of objectives
A simple, popular, and well-known approach to finding the design configurations associated
with the Pareto set is to convert the multi-objective optimization problem into a single objective
convex optimization of the different objectives [14–16]. This is done using a weighted-sum
approach, where the problem given in Equation 5 is solved:
(5)
In this formulation, the t weights are often chosen such that wi > 0 and
. Solving
Equation 5 for a specific set of weights creates a single Pareto optimal point. In some
formulations, it is required that the weighting value be a positive number. This is because a
weight value of zero can lead to a weakly dominated design.
To generate several points in the Pareto set, an even spread of the weights can be sampled.
However, research has indicated several limitations to this strategy [17–21], including that:

non-dominated solutions in non-convex regions of the Pareto set cannot be generated,
regardless of weight granularity

an even spread of weights does not produce an even spread of Pareto points in the
performance space

it can be difficult to know the relative importance of an objective a priori

objectives must be normalized before weights are defined such that the weighting
parameter is not merely compensating for differences in objective function magnitude

the weighted objective function is only a linear approximation in the performance
space

it is not computationally efficient to solve Equation 5 multiple times (once for each
set of weights considered)
281
Difficulties of generating non-convex points
The inability to generate non-convex solutions of the Pareto set can be graphically illustrated
when two objectives are considered [18]. Here, Equation 5 simplifies to:
(6)
where w is constrained between 0 and 1 in Equation 6. An equivalent formulation is a
trigonometric linear combination as shown in Equation 7, where the scalar  varies between 0
and /2. Since these formulations are equivalent, a non-dominated solution can be found when
and when
. Also, if a non-dominated solution cannot be found using
Equation 7, then it cannot be obtained using any convex combination of two objectives.
(7)
If the axes
that define the performance space are rotated counterclockwise by an
angle
, the rotated axes are given by Equation 8. From Figure 3, minimizing is
equivalent to translating the axis parallel to itself until it intersects the non-dominated solution
set. This intersection point is the Pareto point P. Solving the problem given by Equation 7 for all
values of explores different axis rotations by varying the slope of the tangent from 0 to
while maintaining contact with the non-dominated frontier.
(8)
Figure 3. Obtaining a Pareto point by solving the trigonometric linear combinations
problem (adapted from [18])
If all of the objective functions and constraints for a multi-objective optimization problem are
convex—the Hessian matrix is positive semi-definite or positive definite at all points on the
set—then the weighted sum approach can locate all Pareto optimal points. However, consider the
scenario presented by Figure 4. For line segment
the slope of the tangent touches the Pareto
282
frontier at two distinct locations. As the slope of
is rotated to become less negative, Pareto
frontier points in the region of
are located. Rotations making the slope of
more negative
identify Pareto points in the region of
. Since there are no rotations capable of identifying
solutions in the
portion of the Pareto frontier, the non-convex region would not be found.
Figure 4. Illustrating the inability to locate non-convex portions of the Pareto frontier using
trigonometric linear combinations (adapted from [18])
Difficulties of generating an even spread of Pareto points
Even if the objective functions and constraints are convex, an even spread of weight values
does not guarantee an even spread of Pareto points in the performance space. Rather, solutions
will often clump in the performance space and provide very little information about the possible
tradeoffs elsewhere. To demonstrate this challenge, consider the multi-objective optimization
problem given in Equation 9.
(9)
A set of 11 Pareto points was found for this problem by varying w from 0 to 1 in even
increments of 0.1. Figure 5 shows these points plotted in the performance space. In this figure,
only 9 Pareto points are visible, as values of 0.8, 0.9 and 1 for w yield the same solution of (F1,
F2) = (10, -4.0115). There is significant clustering for solutions obtained at small values of w
(where the weighted function primarily focuses on objective F1). Further, there is a noticeable
gap in the frontier that occurs between weighting values of 0.6 and 0.7.
283
Figure 5. Illustrating uneven distribution of Pareto points despite an even distribution of
the weight parameter
Addressing the remaining challenges
Beyond locating points on the Pareto frontier, defining the weights themselves can pose a
significant challenge. One issue that must be addressed before implementing a weighted sum
approach is ensuring a comparable scale for the objective function values. If these values are not
of the same magnitude, some of the weights may have an insignificant impact on the weighted
objective function that is being minimized. Thus, all objective functions should be modified in
such a way that they have similar ranges.
However, as noted by Marler and Arora [21], when exploring solutions to a multi-objective
problem and using weights to establish tradeoffs the objective functions should not be
normalized. This requires an extra computational step, as the objectives must be first transformed
for the optimization and then transformed back when presenting solutions. Further, the
formulation presented in Equation 5 assumes a linear “preference” relationship between
objectives. This has led researchers to explore more advanced non-linear relationships between
objective functions [22].
Finally, solving for the Pareto point associated with a given weight combination requires an
optimization to be conducted. If 100 points are desired for the Pareto set, 100 optimizations must
be run. For computationally expensive problems, this can prove to be burdensome and
challenging. In response to these issues associated with the weighted sum approach, researchers
have explored modifications to existing heuristic optimization approaches that allow for more
efficient and effective identification of the Pareto set. The next section discusses one extension of
genetic algorithms for multi-objective optimization problems.
2.2 Multi-objective genetic algorithms (MOGAs)
A number of multi-objective evolutionary algorithms have been proposed in the literature
[23–30], mainly because they are capable of addressing the many limitations associated with the
weighted sum approach. Further, the engineering community has frequently used the results from
284
multi-objective problem formulations to explore the tradeoffs between competing objectives. In
this “design by shopping” paradigm [31], multidimensional visualization tools are used to
explore a large number of alternative solutions and allow interesting regions of the space to be
selected for further exploration. In support of this goal, multi-objective genetic algorithms
provide the ability to find multiple Pareto optimal points in a single run.
At its most basic level, the foundation of a genetic algorithm can be described by five basic
components [32]. These components are:

a genetic representation of possible problem solutions
Several methods of encoding solutions exist, such as binary encoding,
real number encoding, integer encoding, and data structure encoding.

a means of generating an initial population
A random initial population is typically created that covers as much of
the design space to ensure thorough exploration.

techniques for evaluating design fitness
Design fitness describes the goodness of a possible solution. It is used
to quantify the difference in solution performance or provide a ranking
of the designs.

genetic operators capable of producing new designs using previous design
information
Selection, crossover and mutation are the three primary genetic
operators. Selection is used to determine which designs will produce
offspring. Crossover is used to represent a mating process and
mutation introduces random variation into the designs.

parameter settings for the genetic algorithm
These parameters control the overall behavior of the genetic search.
Examples include population size, convergence criteria, crossover rate,
and mutation rate.
In accommodating problems with multiple performance objectives, it is necessary to modify
the form of the fitness function used to assess a design. There are many reasons for this. First, in
the absence of additional information, it is impossible to say that one Pareto point is better than
another. As shown in the previous section, linear combinations of the objective functions suffer
from multiple limitations. Second, in the presence of constraints a common optimization
procedure is to apply a penalty function to signify infeasibility. However, when multiple
objectives are considered, it is not clear to which objective the penalty should be applied.
285
While multiple variations of multi-objective genetic algorithms exist, this paper focuses on
one of the more popular variants: NGSA-II [26]. Readers interested in a more thorough treatment
of advancements in multi-objective genetic algorithms are directed to [33, 34]. NSGA-II is
primarily characterized by its fast non-dominated sorting procedure, diversity preservation, and
elitism, each discussed in detail below.
Similar to a single objective genetic algorithm, the first step is to create an initial population
of designs. The members of this initial population can either be created randomly or using
targeted procedures [35–37] to improve computational efficiency and improve solution quality.
Fitness of a design is defined by its non-domination level, as shown in Figure 6. Assuming
minimization of performance objective, smaller values of the non-domination level correspond to
better solutions, with 1 being the best.
Figure 6. Representation of ranked points in the performance space
Non-dominated sorting procedure
Defining the non-domination level for a design p begins by determining the number of
solutions that dominate it and the set of solutions (Sp) that p dominates. By the principle of
Pareto optimality, designs with a domination level of 1 start with their domination count at 0.
That is, no designs dominate them. Now, for each design in the current domination level the
domination count of each solution in Sp is reduced by one. If any of these points in Sp are now
non-dominated, they are placed into a separate list. This separate list represents the next nondomination level. This process continues until all fronts (non-domination levels) are identified.
Diversity preservation
Diversity preservation is significant for two reasons. First, it is desired that the final solution
is uniformly spread across the performance space. Second, it provides a tie-breaker during
selection when two designs have the same non-domination rank. In NSGA-II, the crowding
distance around each design is calculated by determining the average distance of two points on
either side of this design along each of the objectives. To ensure that this information can be
aggregated, each objective function is normalized before calculating the crowding distance. A
solution with a larger value of this measure is “more unique” than other solutions, suggesting
that the space around this design should be explored further.
286
Crowding distance is then used as a secondary comparison between two designs during
selection. Assume that two designs have been chosen. The first comparison between the designs
is the non-domination rank. If one design has a lower non-domination rank than the other design,
the design with the lower non-domination rank is chosen. When the non-domination rank is the
same for both points the crowding distance measure is considered, and the design with the larger
value for crowding distance is chosen. This allows designs that exist in less crowded regions of
the performance space to be chosen and helps ensure a more even spread of Pareto points in the
final solution.
Elitism
When a group of children designs are created, they are combined with the original parent
population. This creates a population with a size of 2N, where N is the size of the original
population. To reduce the population size back to N all designs are sorted with respect to their
non-domination ranking and then with respect to their crowding distance. The top N designs are
then chosen to be the parent population for the next iteration.
A flowchart describing the NSGA-II algorithm is shown in Figure 7. The next section revisits
the example problem originally proposed in Section 1 to explore the solution obtained when
using a multi-objective genetic algorithm. Advantages of this approach are also discussed.
Figure 7. Describing the NSGA-II algorithm
3. SETTING UP THE MULTI-OBJECTIVE PRODUCT LINE SEARCH
This section expands upon the example presented in Section 1 by setting up the problem as a
multi-objective product line optimization. As before, two objective functions will be considered.
Four products are to be designed in the line, and the full multi-objective problem formulation is
given by Equation 10.
The first objective is maximizing the market share of preference (SOP) captured by the
product line. The outer summation combines the share of preference captured by each product.
The numerator’s outermost summation combines the probability of purchase for all respondents
( ) for the current product ( ) before dividing by the number of respondents. Respondent j’s
287
probability of purchase is calculated by dividing the exponential of a product’s observed utility
( ) by the sum of the exponentials of the other products, including the part-worth associated
with the “None” option.
The second objective is profit. Profit of a product line can be approximated using
contribution margin per person in the market (i.e., per capita), or per capita contribution margin
(PCCM). To combine the margin of the four products in the line, a weighting scheme must be
constructed using the share of preference of each individual product (
). This ensures that a
product with high margin and low sales does not artificially inflate the metric. PCCM can also be
used to estimate the aggregate contribution margin of a product line by multiplying PCCM by
the market size.
Maximize:
(10)
by changing:
Feature content
with respect to:
No identical products in the same product line
Lower and upper level bounds on each attribute
(Xjk)
The optimization problem formulated in Equation 10 consists of 28 design variables—4
products with 7 attributes per product, as shown in Figure 8. To identify the non-dominated
points for this problem formulation, a multi-objective genetic algorithm was fielded using an
initial population initialized using Latin hypercube sampling. A listing of relevant MOGA
parameters is given in Table 5. The MOGA used in this paper was coded in Matlab [38], and was
an extension of the foundational theory presented in Section 2.2. Figure 9 depicts the location of
the frontier when the stopping criterion of 600 generations was achieved.
Figure 8. Illustration of design string
288
Table 5. Input parameters for the MOGA
Criteria
Initial population size
Offspring created within a generation
Selection
Crossover type
Crossover rate
Mutation type
Mutation rate
Stop after
Setting
280 (10 times the number of design variables)
280 (equal to original population size)
Tournament (4 candidates)
Scattered
0.5
Adaptive
5% per bit
600 generations
Figure 9. Set of non-dominated solutions after 600 generations
The plot shown in Figure 9 illustrates the additional solutions that are found when the
product line design problem is solved using a multi-objective genetic algorithm. The next section
of the paper explores the advantages—in terms of information available, insights gained, and
user interaction—that is possible because of this approach.
4. USING MULTI-OBJECTIVE OPTIMIZATION RESULTS TO GUIDE MARKET-BASED DESIGN
STRATEGIES
Formulating and solving a multi-objective product line design problem is the first step in
defining a design strategy. This section illustrates how information from the solution can
influence the choice of design strategy and problem formulation. By walking through an example
of how such data might be analyzed, it is shown that product architecture insights can be
289
gathered by considering the entire set of non-dominated solutions, and that dominated solutions
can be explored to accommodate preferences associated with un-modeled objectives. If this
causes the scope of the multi-objective problem to be expanded, the space can be explored using
interactive multidimensional visualization tools. Visualizing the non-dominated set allows areas
of interest to be identified in “real-time” and solutions populated in those areas of the space.
4.1 Deriving product architecture insights from the non-dominated set of solutions
The increased information available in Figure 9 may make it difficult for a manufacturer to
select a solution. Consider that the manufacturer selects four non-dominated points that appear
interesting. As shown in Figure 10, the “Max. share” solution is chosen because it maximizes
share of preference. The “Max. profit” solution is chosen because it maximizes profit. The
“Profit trade” solution is chosen because it gains a significant increase in share (almost 8%)
while sacrificing very little in profit. Finally, the “Share trade” solution is chosen because it gains
profit with a very small decrease in market share (about 1%).
Figure 10. Selecting four of the non-dominated solutions
Product configurations for the four solutions are shown in Table 6. Also shown is the
calculated share of preference and profit for a market size of 10,000 people. From this
information, the manufacturer can then compare the properties of a solution. For instance, the
“Profit trade” solution is the only product line to offer “Buttons” as an input, and it only occurs
in one of the products. The share-focused solutions are configured with very small storage
options compared to that of the profit-focused solutions. This supports the insight from Tables 3
and 4 that suggested smaller storage sizes are used to capture share (because they are less
expensive products that can compete against the None), while larger storage options generate
more overall profit (at the expense of share of preference).
290
Table 6. Product line configurations for the four chosen solutions
96.33%
$1.070
million
Share
trade
solution
95.38%
$1.198
million
Photo/Video/Camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Web/App/Ped
Web and app only
Web and app only
Web and app only
Web and app only
Dial
Dial
Buttons
Dial
1.5 in diag.
1.5 in diag.
1.5 in diag.
1.5 in diag.
16 GB
16 GB
16 GB
32 GB
Max. share
solution
Product line
Share of preference
Profit
Input
Product 1
Screen Size
Storage
Background Color
88.42%
$1.348
million
82.16%
$1.355
million
Silver
Silver
Silver
Silver
Custom pattern
and graphic
Custom graphic
Custom pattern
and graphic
$166
$166
$177.25
$222.25
Photo/Video/Camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Web/App/Ped
Web and app only
Web and app only
Web and app only
Web and app only
Dial
Touchscreen
Touchscreen
Touchscreen
4.5 in diag.
4.5 in diag.
4.5 in diag.
4.5 in diag.
Storage
16 GB
16 GB
160 GB
160 GB
Background Color
Silver
Silver
Silver
Silver
Custom graphic
Custom graphic
Custom graphic
Custom graphic
$207.25
$237.25
$391
$391
Photo/Video/Camera
Photo, video, and
hi-res camera
Photo, video, and
hi-res camera
Web/App/Ped
Web and app only
Web and app only
Touchscreen
Touchscreen
Photo, video, and
hi-res camera
Web, app, and
ped
Touchscreen
Photo, video, and
hi-res camera
Web, app, and
ped
Touchscreen
4.5 in diag.
4.5 in diag.
4.5 in diag.
4.5 in diag.
16 GB
16 GB
500 GB
500 GB
Black
Price
Input
Screen Size
Background Overlay
Price
Input
Product 3
Max. profit
solution
Custom pattern
and graphic
Background Overlay
Product 2
Profit trade
solution
Screen Size
Storage
Background Color
Background Overlay
Price
Silver
Custom
Black
Custom pattern
and graphic
Custom graphic
Custom pattern
and graphic
Custom pattern
and graphic
$235.5
$237.25
$484.75
$469.75
291
Photo/Video/Camera
Web/App/Ped
Input
Product 4
Photo, video, and
hi-res camera
Web, app, and
ped
Touchscreen
Photo and video
only
Web, app, and
ped
Touchscreen
Photo, video, and
hi-res camera
Web, app, and
ped
Touchscreen
Photo, video, and
hi-res camera
Web, app, and
ped
Touchscreen
6.5 in diag.
4.5 in diag.
6.5 in diag.
6.5 in diag.
32 GB
64 GB
750 GB
750 GB
Custom
Black
Green
Green
No pattern or
graphic
Custom pattern
and graphic
Custom pattern
and graphic
Custom pattern
and graphic
$316
$337
$529.75
$529.75
Screen Size
Storage
Background Color
Background Overlay
Price
Similar to the results in Tables 3 and 4, the product configurations in Table 6 show a
significant degree of commonality. To get a better understanding of the solution space, the
manufacturer explores (i) how many unique product configurations exist, and (ii) how many
different attribute levels are used. Figure 4 has 71 solutions, and each solution has 4 products per
line, meaning that there are 276 total product configurations in the solution set. Of these 276
products, only 47 unique configurations exist. The breakdown of attribute usage in these 47
products is shown in Table 7.
Table 7. Breakdown of attribute usage in the 47 unique product configurations
Attribute
Level
Photo/Video/Camera
Web/App/Ped
Input
1
2
3
4
5
6
7
8
0
1
0
4
0
0
1
41
0
0
1
0
22
1
0
23
7
2
36
2
Screen
Size
8
0
1
24
1
13
Storage
Background
Color
Background
Overlay
0
12
4
6
12
0
4
9
14
0
16
1
1
7
0
8
1
2
15
29
The darkest cells represent attribute levels used in over 20% of the products. Examples
include the “Photo, Video, and Hi-Res Camera” option (in 41 of the 47 products) and a
“Touchscreen” input (in 36 of the 47 products). The remaining shaded cells are used in at least
one product, such as the “Photo only” option (in 1 of the 47 products). There are 13 product
attribute levels (28.26% of the total number of attribute levels) that are never used.
For product attributes where one level is primarily used, the manufacturer may want to
consider product platforming possibilities [39]. By making this attribute a key element of the
product line’s architecture, the manufacturer may realize cost savings that permit variety to be
offered in the other attributes. For example, it appears that there are few business advantages of
offering an MP3 player without the ability to play photos and videos and capture content using a
hi-resolution camera. Yet, the solution set—and thereby the market—is not nearly as
homogeneous when it comes to storage size. By platforming around the first attribute, solutions
from the non-dominated set can be explored that have multiple storage sizes. By doing this, a
manufacturer can strategically offer product variety in a way that captures different groups in the
respondent market.
292
4.2 Influencing design strategy using dominated designs
Suppose that after considering the trade ramifications between business goals the
manufacturer has identified a favorite solution from Figure 4. As shown in Figure 6, this solution
captures 93.452% share of preference and returns a profit of $1.261 million. Since this is a
zoomed in version of Figure 10, the “Share trade” solution is also identified to give perspective.
Figure 11. Improving commonality by considering dominated solutions
Now, assume that the manufacturer (after interpreting the results in Tables 3, 4 and 7) is also
interested in the commonality associated with the product line solution. Commonality often has a
benefit due to tooling cost and inventory savings. The commonality index (CI) was introduced
by Martin and Ishii [40,41] as a measure of unique parts, and can be calculated using Equation
11.
(11)
Here, u is the total number of distinct components, mi represents the number of components
used in variant i, and n is the number of variants in the product line. CI ranges from 0 to 1 where
a smaller value indicates more unique parts. The CI for the current solution is 0.4762.
The manufacturer likes this solution with respect to the business goals, but would also be
interested in a solution with increased commonality to potentially realize greater cost savings.
Recall that the non-dominated set is comprised of Rank 1 solutions (triangles). Moving a Rank 1
solution to the right (increasing share of preference) corresponds to a solution with a CI value of
0.4762. Moving a Rank 1 solution to the left (decreasing share of preference) corresponds to a
solution with a CI value of 0.4762—or no change in CI.
Not wanting to deviate too far from the current solution, the manufacturer uses the MOGA
data to recall the Rank 2 solutions. These solutions are only dominated by Rank 1 solutions, as
discussed in Section III.C. Exploring these Rank 2 solutions, the manufacturer finds a product
293
line with a CI of 0.5714. This Rank 2 product line configuration, also highlighted in Figure 11,
has a share of preference of 93.2945% and a profit of $1.2585 million. By keeping a rank listing
of the final population from—or all points evaluated by—the MOGA, the manufacturer has a
degree of design freedom with which to explore the solution space.
4.3 Using n-dimensional problem formulations to drive product strategy
Considering dominated designs is one strategy for navigating the solution space while trying
to minimize the trades made between conflicting objectives. However, it is not guaranteed that
the solution in Figure 11 is optimal across all three objectives (maximize share of preference,
maximize profit, maximize commonality). Since multi-objective problem formulations allow for
a nearly “infinite” number of objectives to be considered simultaneously, the manufacturer may
want to restructure the problem to consider all three objectives. Equation 12 shows the extension
of the problem originally posed in Equation 10.
Maximize:
(12)
by changing:
Feature content
with respect to:
No identical products in the same product line
Lower and upper level bounds on each attribute
(Xjk)
Including additional objectives in the problem formulation allows for a focus to be placed on
various business objectives. For example, a manufacturer may want to decide on a product
strategy by maximizing the penetration of a particular attribute level. A focus could also be
placed on maximizing the probability of purchase for a specific demographic. Other possible
examples include designing for age groups, genders, or geographical location. However,
increasing the size of the optimization problem does not come without a cost. Adding objectives
increases computational expense and the difficulty associated with navigating the solutions for
insights. Therefore, strategic choices of problem objectives must be made to balance
computational expense, the magnitude of information reported, and variety of business goals.
Efforts to develop effective tradespace exploration tools [11–13_ENREF_64] facilitate
multidimensional visualization, filtering of unwanted solutions, and detailed exploration of
interesting regions of the solution space. Tradespace exploration tools, like Penn State’s ARL
Trade Space Visualizer (ATSV) [11, 12], can also be linked to optimization algorithms to enable
real-time user interaction and design steering. When the number of objectives or criteria
considered goes beyond four or five, visualization tools become significantly less effective.
Technical feasibility models have been developed that create parametric representations of the
294
non-dominated frontier. This work has demonstrated that the feasibility of a desired set of
business goals could be tested in the engineering domain and that the necessary product
configurations capable of meeting these goals could be quickly determined [10_ENREF_25, 42].
Since this problem only has three objectives, the manufacturer can use software like ATSV to
visualize the non-dominated set of solutions. Figure 12 shows a three-dimensional glyph plot of
the non-dominated solution set. Color has been added to this figure to show the various levels of
commonality. ATSV would allow the manufacturer to explore this three dimensional plot by
rotating, zooming and panning. This would allow color to be used to display information about
another aspect of the problem. However, color was chosen here to show commonality due to the
perspective challenges associated with the printed page.
As expected, an increase in commonality sacrifices performance with respect to share of
preference and profit. However, possible cost savings from increased commonality are not
included in the problem formulation. This could be added with more advanced costing models.
With the information provided, the manufacturer explores the available solutions and chose the
one that best reflects the acceptable trade between objectives, as shown by the highlighted
design.
Figure 12. Non-dominated solutions for the three objective problem formulation
Table 8 shows the configuration of the product line, sorted by price. Commonality in this
solution is larger than the previous solutions selected by the manufacturer. The lightly shaded
cells reflect configuration changes that are included in more than one product. Darker cells
reflect configuration changes that are unique to a single product. In the first product, for
example, the configuration change (1.5 in diagonal screen) is done to reduce cost. For the more
expensive product, two unique changes exist that increase product price but provide variety not
offered by the other products in the line.
295
Table 8. Product line chosen from three objective problem formulation
Share of preference:
Profit:
Commonality index:
96.186%
$1.0705 million
0.6667
Product configurations
Price
Photo,video
and hi-res
camera
Web and
App
Dial
1.5 in diag
16 GB
Silver
Custom pattern
and graphic
overlay
$166
Photo,video
and hi-res
camera
Web and
App
Dial
4.5 in diag
16 GB
Silver
Custom graphic
overlay
$207.25
Photo,video
and hi-res
camera
Web, App,
and Ped
Touchscreen
4.5 in diag
16 GB
Black
Custom pattern
and graphic
overlay
$241
Photo,video
and hi-res
camera
Web and
App
Touchscreen
6.5 in diag
32 GB
Silver
Custom graphic
overlay
$308.50
4.4 Targeting areas of the solution space in areas of interest
While the non-dominated solutions in Figure 12 do provide significantly more information
about the tradeoff between share of preference, profit, and commonality, there are still regions of
the solution space where no tradeoff information is present. These “gaps” [42_ENREF_67] in the
non-dominated set may exist because the algorithm has not yet found points in that region of the
performance space, or because no solutions are possible in that region.
Exploring the results in Figure 12, the manufacturer notices that there is one such gap
directly “below” the product line solution shown in Table 8. Curious as to whether product line
solutions exist in that region of the space, the manufacturer can modify the problem formulation
posed in Equation 12. As shown in Equation 13, adding side constraints to the values of the
objective function provides a region of attraction for the algorithm to target. The MOGA then
can be re-initialized to pursue solutions that exist within these domains of attraction.
296
Maximize:
(13)
by changing:
Feature content
with respect to:
No identical products in the same product line
Lower and upper level bounds on each attribute
(Xjk)
When used in conjunction with multidimensional visualization tools, these attractors
[12_ENREF_65] can be placed in “real-time” as the algorithm progresses toward the final set of
non-dominated points. This allows a manufacturer to explore regions of the space deemed
interesting and directly supports the original concept of the “design by shopping” paradigm
[31_ENREF_54] where manufacturers play an active role in arriving at the final answer. To fill
the gap in this frontier, the manufacturer places upper and lower bounds on share of preference
(95.85% to 96.11%) and profit ($1.099 million to $1.137 million). The MOGA is re-run with
these constraints in place to find additional solutions.
The result of this optimization are shown in Figure 13. The glyph plot shows that the MOGA
was able to find product line solutions that existed “below” the currently selected design. In this
figure, the size of the boxes scales with the value of commonality index. Large boxes represent a
CI value closer to 1, while smaller boxes represent a CI value closer to 0. The manufacturer now
explores these new solutions to see if one of them better meets the business goals than the
previously selected design configurations. Finding a solution more in line with the desired
business outcomes—with increased profit and commonality at a small share of preference
penalty—the manufacturer selects this new solution, detailed in Table 9.
297
Figure 13. Glyph plot showing additional solutions found using an attractor
Table 9. Optimal product line when maximizing profit
Share of preference:
Profit:
Commonality index:
95.991%
$1.118 million
0.714286
Product configurations
Photo,video
and hi-res
camera
Web and
App
Dial
1.5 in diag
16 GB
Photo,video
and hi-res
camera
Web and
App
Dial
4.5 in diag
Photo,video
and hi-res
camera
Web and
App
Touchscreen
Photo,video
and hi-res
camera
Web, App,
and Ped
Touchscreen
Price
Silver
Custom pattern
and graphic
overlay
$166
16 GB
Silver
Custom graphic
overlay
$207.25
4.5 in diag
16 GB
Black
Custom pattern
and graphic
overlay
$233.5
4.5 in diag
160 GB
Black
Custom graphic
overlay
$391
The shaded cells in Table 9 show where this selected solution differs in product configuration
from the solution selected in Table 8. The first two products are identical. The third product
removes the pedometer, only offering web and app access. This reduces product price by $7.50.
The most significant changes come in the fourth product. Here, the pedometer is added on, the
screen size is reduced (to 4.5 inches from 6.5 inches), storage size is increased (160 GB from 32
GB), and color is changed to black from silver. Finally satisfied with the product solution found
after leveraging market information to evaluate the solutions and using multidimensional
visualization to explore the space, the manufacturer locks down the solution and begins
production.
298
5. CONCLUSIONS AND FUTURE WORK
Multi-objective optimization algorithms and multidimensional visualization tools have been
developed and used by the engineering design community over the last 25 years. The objective
of this paper was to demonstrate how these tools and technologies could be extended to marketdriven product searches. Toward this goal, Section 1 introduced a case study problem centered
around the design of an MP3 player product line. Using part-worths estimated from Sawtooth
Software’s CBC/HB module, the Advanced Search Module within SMRT was used to optimize
product lines around a single objective. Results from these optimizations showed that the two
objectives considered—share of preference and profit—were in conflict. More importantly, no
information was available about the tradeoff that existed between these objectives except for
knowledge about the two solutions that comprised the “endpoints” of the solution space.
Section 2 introduced the weighted sum approach—an easy and extremely common
approach—to solving problems with multiple objectives. However, the many limitations of this
approach were also highlighted. Significant limitations are the inability to generate solutions in
the non-convex region of the non-dominated set, the inability to create solutions with an even
spread in the performance space, and the computational expense associated with finding many
non-dominated solutions. To address these shortcomings, the foundational background for a
popular multi-objective genetic algorithm—NSGA-II—was presented in Section 2.2. This
algorithm addresses many of the limitations of the weighted sum approach while simultaneously
encouraging solution diversity and maintaining elitism.
A solution to the multi-objective problem was introduced in Section 3 using the multiobjective genetic algorithm. This led to a discussion in Section 4 of advantages of conducting a
product search using a multi-objective problem formulation. An increased number of solutions—
from 2 to 71—provided enormous amounts of additional insight into the problem. It was shown
how the solution set of the 71 products was comprised of only 47 unique products, and that there
were significant opportunities for enforcing commonality or eliminating attribute levels with few
negative ramifications. The advantage of retaining dominated designs from the final population
was also shown to have merit when unmodeled criteria were considered. Here, commonality
within the product line could be increased by selecting a dominated design with minimal losses
to the business objectives while staying in the desired region of the performance space.
Section 4 also highlighted how exploration of the solution space could be enabled by
interactive multidimensional visualization tools. This technology enables both the user to direct
attention to specific regions of interest in the performance space while also accommodating
direct comparisons between non-dominated solutions. Research challenges arising from this
work include understanding the progression of the non-dominated solutions in the performance
space as new technologies, product attributes, and competition are introduced. Further, there is a
need to understand how the selected product design strategy changes over time. A manufacturer
might initially adopt a product strategy designed to capture a large initial market foothold. Over
time, however, the goals of the manufacturer may change to focusing on profit while maintaining
some aspect of market share. Future work could look to model and capture this information.
ACKNOWLEDGEMENTS
The authors gratefully acknowledge support from the National Science Foundation through
NSF CAREER Grant No. CMMI-1054208 and NSF Grant CMMI-0969961. Any opinions,
299
findings, and conclusions presented in this paper are those of the authors and do not necessarily
reflect the views of the National Science Foundation.
Scott Ferguson
REFERENCES
[1] Sawtooth Software, 2008, “CBC v.6.0,” Sawtooth Software, Inc., Sequim, WA,
http://www.sawtoothsoftware.com/download/techpap/cbctech.pdf.
[2] Sawtooth Software, 2009, “The CBC/HB System for Hierarchical Bayes Estimation Version
5.0 Technical Paper,” Sawtooth Software, Inc., Sequim, WA,
http://www.sawtoothsoftware.com/download/techpap/hbtech.pdf.
[3] Sawtooth Software, 2003, “Advanced Simulation Module for Product Optimization v1.5
Technical Paper,” Sequim, WA.
[4] Pareto, V., 1906, Manuale di Econòmica Polìttica, Società Editrice Libràia, Milan, Italy;
translated into English by A. S. Schwier, as Manual of Political Economy, Macmillan, New
York, 1971.
[5] Tappeta, R. V., and Renaud, J. E., 1997, “Multiobjective Collaborative Optimization,”
Journal of Mechanical Design, 119(3): 403–412.
[6] Coello, C. A. C., and Christiansen, A. D., 1999, “MOSES: A Multiobjective Optimization
Tool for Engineering Design,” Engineering Optimization, 31(3): 337–368.
[7] Narayanan, S., and Azarm, S., 1999, “On Improving Multiobjective Genetic Algorithms for
Design Optimization,” Structural Optimization, 18(2–3): 146–155.
[8] Marler, R. T., and Arora, J. S., 2004, “Survey of Multi-Objective Optimization Methods for
Engineering,” Structural and Multidisciplinary Optimization, 26(6): 369–395.
[9] Mattson, C., and Messac, A., 2005, “Pareto Frontier Based Concept Selection Under
Uncertainty, With Visualization,” Optimization and Engineering, 6(1): 85-115.
[10] Gurnani, A., Ferguson, S., Donndelinger, J., and Lewis, K., 2005, “A Constraint-Based
Approach to Feasibility Assessment in Conceptual Design,” Artificial Intelligence for
300
Engineering Design, Analysis and Manufacturing, Special Issue on Constraints and Design,
20(4): 351-367.
[11] Stump, G., Yukish, M., Martin, J., and Simpson, T., 2004, “The ARL Trade Space
Visualizer: An Engineering Decision-Making Tool,” 10th AIAA/ISSMO Multidisciplinary
Analysis and Optimization Conference, Albany, NY, AIAA-2004-4568.
[12] Stump, G. M., Lego, S., Yukish, M., Simpson, T. W., and Donndelinger, J. A., 2009,
“Visual Steering Commands for Trade Space Exploration: User-Guided Sampling with
Example,” Journal of Computing and Information Science in Engineering, 9(4): 044501:1–
10.
[13] Daskilewicz, M. J., and German, B., J., 2011, “Rave: A Computational Framework to
Facilitate Research in Design Decision Support,” Journal of Computing and Information
Science in Engineering, 12(2): 021005:1–9.
[14] Eckenrode, R. T., 1965, “Weighting Multiple Criteria,” Management Science, 12: 180–
192.
[15]
Arora, J, S., 2004, Introduction to Optimum Design—2nd edition, Academic Press.
[16]
Rao, S., 2009, Engineering Optimization: Theory and Practice—4th edition, Wiley.
[17] Athan, T. W., and Papalambros, P. Y., 1996, “A Note on Weighted Criteria Methods for
Compromise Solutions in Multi-objective Optimization,” Engineering Optimization, 27:
155–176.
[18] Das, I., and Dennis, J. E., 1997, “A Closer Look at Drawbacks of Minimizing Weighted
Sums of Objectives for Pareto Set Generation in Multicriteria Optimization Problems,”
Structural Optimization, 14: 63–69.
[19] Chen, W., Wiecek, M. M., and Zhang, J., 1999, “Quality Utility—A Compromise
Programming Approach to Robust Design,” Journal of Mechanical Design, 121: 179–187.
[20] Messac, A., and Mattson, C. A., 2002, “Generating Well-distributed Sets of Pareto Points
for Engineering Design Using Physical Programming,” Engineering Optimization, 3: 431–
450.
[21] Marler, R. T., and Arora, J. S., 2010, “The Weighted Sum Method for Multi-objective
Optimization: New Insights,” Structural and Multidisciplinary Optimization, 41(6): 853–62.
[22] Kim, I. Y., and de Weck, O. L., 2005, “Adaptive Weighted-Sum Method for Bi-objective
Optimization: Pareto Front Generation,” Structural and Multidisciplinary Optimization,
29(2): 149–58.
[23] Fonseca, C. M., and Fleming, P. J., 1998, “Multiobjective Optimization and Multiple
Constraint Handling with Evolutionary Algorithms—Part I: A Unified Formulation,” IEEE
Transactions on Systems, Man, and Cybernetics, Part A: Systems and Humans, 28(1): 38–47.
[24] Zitzler, E. and Thiele, L., 1999, “Multi-objective Evolutionary Algorithms: A
Comparative Case Study and the Strength Pareto Approach,” IEEE Transactions on
Evolutionary Computation, 3(4): 257–271.
301
[25] Zitzler, E., Deb, K., and Lothar, T., 2000, “Comparison of Multi-objective Evolutionary
Algorithms: Empirical Results,” Evolutionary Computation, 8(2): 173–195.
[26] Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T., 2002, “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Transactions on Evolutionary Computation,
6(2): 182–197.
[27] Coello, C. A. C., Pulido, G. T., and Lechuga, M. S., 2004, “Handling Multiple Objectives
with Particle Swarm Optimization,” IEEE Transactions on Evolutionary Computation, 8(3):
256–279.
[28] Zhang, Q., Li, H., 2007, “MOEA/D: A Multi-objective Evolutionary Algorithm Based on
Decomposition,” IEEE Transactions on Evolutionary Computation, 11(6): 712–731.
[29] Bandyopadhyay, S., Saha, S., Maulik, U., and Deb, K., 2008, “A Simulated Annealingbased Multi-objective Optimization Algorithm: AMOSA,” IEEE Transactions on
Evolutionary Computation, 12(3): 269–283.
[30] Hadka, D., and P. Reed. 2013. “Borg: An Auto-Adaptive Many-Objective Evolutionary
Computing Framework.” Evolutionary Computation 21 (2):231–259.
[31] Balling, R., 1999, “Design by Shopping: A New Paradigm,” Proceedings of the Third
World Congress of Structural and Multidisciplinary Optimization, 295–297.
[32] Michalewicz, Z., and Schoenauer, M., 1996, “Evolutionary Algorithms for Constrained
Parameter Optimization Problems,” Evolutionary Computation, 4(1): 1–32.
[33] Coello, C. A. C., 2006, “Evolutionary Multi-objective Optimization: A Historical View of
the Field,” Computational Intelligence Magazine, IEEE, 1(1): 28–36.
[34] Coello, C. A. C., Lamont, G. B., and Van Veldhuisen, D. A., 2007, Evolutionary
Algorithms for Solving Multi-objective Problems, Springer, NY.
[35] Turner, C., Foster, G., Ferguson, S., Donndelinger, J., and Beltramo, M., 2012, “Creating
Targeted Initial Populations for Genetic Product Searches,” 2012 Sawtooth Software Users
Conference, Orlando, FL.
[36] Turner, C., Foster, G., Ferguson, S., Donndelinger, J., “Creating Targeted Initial
Populations for Genetic Product Searches in Heterogeneous Markets,” Engineering
Optimization.
[37] Foster, G., and Ferguson, S., “Enhanced Targeted Initial Populations for Multi-objective
Product Line Optimization,” Proceedings of the ASME 2013 International Design
Engineering Technical Conference & Computers and Information in Engineering
Conference, Design Automation Conference, Portland, OR, DETC2013-13303.
[38]
Matlab, the Mathworks.
[39] Simpson, T. W., Siddique, Z, and Jiao, R. J., 2006, Product Platform and Product Family
Design: Methods and Applications, Springer.
[40] Martin, M. V., and Ishii, K., 1996, “Design for Variety: A Methodology for
Understanding the Costs of Product Proliferation,” Proceedings of the 1996 ASME Design
Engineering Technical Conferences, Irvine, CA, DTM-1610.
302
[41] Martin, M. V., and Ishii, K., 1997, “Design for Variety: Development of Complexity
Indices and Design Charts,” Proceedings of the 1997 ASME Design Engineering Technical
Conferences, Sacramento, CA, DFM-4359.
[42] Ferguson, S., Gurnani, A., Donndelinger, J., and Lewis, K., 2005, “A Study of
Convergence and Mapping in Preliminary Vehicle Design,” International Journal of Vehicle
Systems Modeling and Testing, 1(1/2/3): 192–215.
303
A SIMULATION BASED EVALUATION OF THE PROPERTIES
OF ANCHORED MAXDIFF:
STRENGTHS, LIMITATIONS AND RECOMMENDATIONS FOR PRACTICE
JAKE LEE
MARITZ RESEARCH
JEFFREY P. DOTSON
BRIGHAM YOUNG UNIVERSITY
INTRODUCTION
Over the past few years MaxDiff has emerged as an efficient way to elicit a rank ordering
over a set of attributes. In its essence, MaxDiff is a specific type discrete choice exercise where
respondents are asked to identify a subset of items from a list that they feel are the most and least
preferred, as they relate to some decision of interest. Through repeated choices, researchers can
infer the relative preference of these items, thus providing a prioritized list of actions a firm
could take to improve operations.
A well-known limitation of a standard MaxDiff study is that it does not allow for
identification of a preference threshold, or the point that differentiates important from
unimportant items. MaxDiff allows us to infer the relative preference of the items tested in the
study, but cannot determine which of these items the respondent believes the company should
actually change (i.e., which items would have a meaningful impact on their choice behavior). It
is possible that respondents with vastly different preference thresholds can manifest similar
response patterns, thus resulting in similar utility scores across the set of items. Without a
preference threshold, a respondent who thinks all options are in need of improvement could be
indistinguishable from a respondent that believes that nothing is in need of improvement.
Anchored MaxDiff has emerged as a solution to this particular problem. Three anchored
MaxDiff techniques have been proposed to resolve this issue through the introduction of a
respondent-specific threshold (or anchor) within or in conjunction with the MaxDiff exercise.
These approaches are the Indirect (Dual Response) method (Louviere, Orme), the Direct
Approach (Lattery), and the Status-Quo Approach (Chrzan and Lee). Both the Indirect and Direct
approaches have been examined in prior Sawtooth Software Conference Proceedings and white
papers. While the Status-Quo approach has not been formally studied, it has been mentioned in
the same sources.
In this paper we examine both the theoretical and empirical properties of each of the
proposed anchored MaxDiff techniques. This is accomplished through the use of a series of
simulation studies. By simulating data from a process where we know the truth, we can contrast
the ability of each approach to recover the true rank order of items and their relation to the
preference threshold. Through this research we hope to identify when and under what
circumstances each approach is likely to prove most effective, thus allowing us to provide
practical advice to the practitioner community.
305
THEORETICAL FOUNDATIONS
Our approach to studying the properties of the various anchored MaxDiff approaches is built
upon the idea that all discrete outcomes can be characterized as the (truncated) realization of an
unobserved continuous process. In the case of choice data generated through a random utility
model, it is believed that there exists a latent (continuous) variable called utility that allows the
respondent to define a preference ordering over alternatives in a choice set. The utility for each
alternative in a choice set is assumed to be known by the respondent at the time of choice.
Information about utility is revealed to the researcher as the respondent reacts to various choice
sets (e.g., picks the best, picks the best and worst, rank orders, etc.).
This process is illustrated in Figure 1 where a hypothetical respondent reacts to a set of 4
alternatives in a choice set. By selecting alternative B as the best, the respondent provides
information to the researcher about the relative rank ordering of latent utility. Specifically, we
know from this choice that the utility for alternative B is greater than the utility for alternatives
A, C, and D. No information, however, is provided to the researcher about the relative ranking of
utility for the non-selected alternatives.
Figure 1
Implied Latent Utility Structure for a “Pick the Best” Choice Task
Choice task
Latent Utility
A
B
A
1
B
C
C
D
D
UD <> UC <> UA < UB
In the case of a “Best-Worst” choice task the respondent provides information about the
relative utility of two items in the choice set (i.e., the most and least preferred). In the example
provided in Figure 2, the respondent identifies alternative B as the best and alternative C as the
worst. As such, we know that alternative B is associated with the greatest level of utility and
alternative C is associated with the lowest level of utility. No information is provided about the
relative attractiveness of alternatives A and D. An argument in favor of MaxDiff analysis is that it
economizes respondent time and effort by extracting more information from a given choice set
than would be obtained by having the respondent simply pick the best. Presumably, it is easier to
evaluate a single choice set and pick the best and worst alternatives than it would be to pick the
best alternative from two separate choice sets.
306
Figure 2
Implied Latent Utility Structure for a “Best-Worst” Choice Task
Choice task
Latent Utility
A
A
B
1
C
4
D
B
C
D
UC < UD <> UA < UB
Figure 3 extends the analysis in Figure 2 by introducing a preference threshold. In this
example the respondent is asked to pick the best and worst alternatives, thus informing us about
their relative latent utility. In a follow-up question, the respondent is asked if any of these of
these items exceed their preference threshold. This is an example of the indirect or dual-response
approach to anchored MaxDiff. As illustrated in Figure 3, an answer of “no” informs the
researcher that the latent utility for the outside good (i.e., the preference threshold) is greater than
the utility for all of the alternatives within the choice set. This is extremely useful information as
it tells us that even though alternative B is most preferred, it is only the best option of an
unattractive set of options. Investing in alternative B would not lead to a meaningful change in
the respondent’s behavior.
307
Figure 3
Implied Latent Utility Structure for a “Best-Worst”
Choice Task with a Dual Response Anchor
Choice task
Latent Utility
A
B
A
B
1
C
C
4
D
D
Anchor
Anchor
Importance threshold
no
APPROACHES TO ANCHORED MAXDIFF
As discussed above, three approaches have proposed to provide a preference threshold or
anchor for MaxDiff studies:
The Indirect Approach
The Indirect (or dual response) Approach to anchored MaxDiff involves the use of a series of
follow-up questions for each choice task. An example of this style of choice task is presented in
Figure 4. In each choice task, respondents are first asked to complete a standard MaxDiff
exercise. Following selection of the best and worst options, they are asked to identify which of
the following are true: (1) All of these features would enhance my experience, (2) None of these
features would enhance my experience, or (3) Some of these features would enhance my
experience.
The subject’s response to this question informs us about the relative location of the
preference threshold in the latent utility space. Selection of option (1) tells us that the utility for
the anchor is less than the utility for all presented features, whereas selection of option (2) tells
us that the latent utility for the anchor is greater than the utility for the presented features. By
choosing option (3) we know that the latent utility of the anchor falls somewhere between the
best and worst features.
308
Figure 4
Example Choice Task with an Indirect (Dual Response) Anchor
Thinking of your restaurant visit, which of these features, if improved, would
most/least enhance your experience?
Most Preferred
Have
Have
Have
Have
Least Preferred
the restaurant be cleaner
the server stop by more often
the server stop by less often
more choices on the menu
Considering just the 4 features above, which of the following best describes your
opinion about enhancing your experience?
All 4 of these features would enhance my experience
None of these features would enhance my experience
Some of these features would enhance my experience, some would not
Anchor < UC < UD <> UB < UA
The Direct Approach
An illustration of the Direct Approach to anchored MaxDiff is presented in Figure 5. In the
Direct Approach, respondents first complete a standard MaxDiff study. Upon conclusion, they
are given a follow up task wherein they are asked to evaluate the relative preference (relative to
the defined preference threshold) for each of the features in the study. In the example below,
respondents are asked to identify which of the features in the study would have a “very strong
positive impact on their experience.” If the respondent selects the first option from the list and
none of the other options, we would know that the latent utility of that option is greater than the
utility of the anchor, and that the utilities of the remaining features are less than the anchor. In
general, an item that is selected in this task has utility in excess of the anchor and items not
selected have utility below the anchor.
309
Figure 5
Example Choice Task with the Direct Anchor
Please tell us which of the features below would have a very strong positive impact on your experience at the restaurant.
(Check all that apply)
Be greeted more promptly
Have the restaurant be cleaner
Change the décor in the restaurant
Have the server stop by more often
Have the server stop by less often
Have the meal served more slowly
Have the meal served more quickly
Receive the check more quickly
Have the server wait longer to deliver the check
Have lower priced menu items
Have fewer kinds of food on the menu
Have more choices on the menu
None of these would have a strong positive impact on my experience
… <> UD <> UC <> UB < Anchor < UA
The Status-Quo Approach
The Status-Quo approach is implemented by incorporating a preference anchor directly into
the study by including it as an attribute in the experimental design. This is illustrated in Figure 6
where the anchor is specified as “No changes—leave everything as it was.” If the option
corresponding to the anchor is selected, we know that the latent utility for the anchor exceeds the
latent utility for all other alternatives in the choice task. If the anchor is selected as the least
preferred attribute, we know that its utility is less than the utility for the other features. Finally, if
the anchor is not selected we know that its utility falls somewhere in between the most and least
preferred features.
Figure 6
Example Choice Task with a Status-Quo Anchor
Thinking of your restaurant visit, which attribute if improved would most/least
enhance your experience?
Most Preferred
Least Preferred
Have the restaurant be cleaner
Have the server stop by more often
Have the server stop by less often
Have more choices on the menu
No changes - leave everything as it was
UC < UD <> UB <> UA < Anchor
310
SIMULATION STUDY
We contrast the performance of each of the proposed anchored MaxDiff approaches using
synthetic choice data (i.e., data where the true latent utility is known). We strongly prefer the use
of simulated data for this exercise for a few reasons. First, with real respondents we wouldn’t
know their true preference structure and would be left to make comparisons based on model fit
not the ability of the approach to recover the true underlying preferences. Second, it allows us to
abstract away from framing differences in the execution of the anchored MaxDiff approaches
described above. It would be exceptionally difficult to frame the questions in each of these
approaches in such a way that they would be consistently interpreted by subjects. Simulation
allows us to assume that respondents are both rational and fully informed when completing these
exercises. Also we avoid psychological effects like the ordering of the tasks and respondent
fatigue across 3 full exercises.
Data for our study are simulated using the following procedure:
1. Simulate the (continuous) latent utility for each respondent and each alternative in a
choice task according the standard random utility model:
, where is a
pre-specified vector of preference parameters for a given respondent and is the random
component of utility and is drawn from a Gumbel distribution.
2. Each simulated respondent is then presented with a set of experimental choice exercises
where they provide best, worst, and anchor responses for each of the three proposed
anchored MaxDiff approaches. It is important to note that given a realization of utility
within a choice set, we can use the same data to populate responses for all of the
anchored MaxDiff approaches. In other words, the same data generating mechanism
underlies all three approaches. They differ only in terms of how that information is
manifest through the subject’s response behavior.
3. Data generated from this simulation exercise is then modeled using an HB MNL model to
recover the latent utility structure for each simulated respondent.
4. These estimated utilities are then compared with the simulated or true latent utilities, thus
allowing us to assess performance of each of the proposed methods.
The procedure described above was repeated under a variety of settings where we modify the
number of attributes, location of preference (e.g., all good, all bad, or mixed), and respondent
consistency (i.e., error scale).
Please note that for our simulation we coded the Direct Method to be consistent with the
Sawtooth Software procedure. That is, for each attribute an additional task is included in the
design matrix pitting the attribute against the anchor with the choice indicating if the box was
selected or not. An alternative specification would be to add just two additional tasks to the
design matrix to identify which attributes are above and below the threshold (Lattery).
SIMULATIONS STUDY RESULTS
We examine 2 objective measures of model performance in order to contrast the results of
each variant of Anchored MaxDiff across the various permutations of the simulation. The first
measure examined is threshold classification. Threshold classification is a measure of how well
the model correctly classifies each of the attributes with respect to the anchor/threshold. It is
reported as the percent of attributes that are misclassified. The second measure of performance is
311
how well the approach is able to recover the true rank ordering of the attributes. This is reported
as the percent of items that are incorrectly ranked.
Model results are presented below for each of the experimentally manipulated settings:
Number of Attributes
In this first test we manipulate the number of attributes tested in a given design. We examine
three conditions, Low, Medium, and High. The number of tasks was held constant for each
condition resulting in an increasing attribute to task ratio. Results of this test are provided in
Figure 7. We observe that with either a medium or low number of attributes all three approaches
perform equally well. However, in the High condition we observe that the Direct Approach
seems to outperform both the Indirect and Status-Quo approaches, with the Status-Quo Approach
performing slightly better than the Indirect Approach.
We believe that the superiority of the Direct Approach in the case of many attributes can be
explained by examining the structure of the follow-up task. In the case of the Direct Approach,
respondents evaluate each item with respect to the anchor following completion of the MaxDiff
exercise. If there are 50 items being tested in the study, this requires completion of an additional
50 questions. In the case of the Status-Quo and Indirect approaches, information about the
preference of items relative to the anchor is captured through the MaxDiff or follow-up question.
For studies with many attributes and relatively few choice tasks, this may not provide much
information about the relationship between a given attribute and the anchor. As such, the Direct
Approach should be more informative about the relative rank ordering and preference of items.
Figure 7
Number of attributes
Threshold Classification
(Percent Misclassified)
Rank Ordering
(Percent Missed)
10.0%
14.0%
12.0%
8.0%
10.0%
6.0%
8.0%
4.0%
6.0%
4.0%
2.0%
2.0%
0.0%
0.0%
Low
Indirect Method
Medium
Direct Method
High
Status-Quo
Low
Indirect Method
Medium
Direct Method
High
Status-Quo
Preference Location
In this second test, we manipulate the location of preference parameters relative to the
anchor. In condition 1, all simulated parameters are lower than the preference threshold (i.e.,
312
nothing should be changed). In condition 2 we set all simulated parameters above the preference
threshold. In the 3rd and final condition we allow the parameters to be mixed with respect to the
threshold (i.e., some items are preferred over the threshold and some others are not).
Results for this first test are presented in Figure 8. In terms of recovery of threshold
classification, as expected, all three approaches perform exceptionally well when the preference
parameters are all either above or below the threshold. When valuations are mixed with respect
to the anchor, the Indirect Method does a slightly better job at recovering parameters. The
indirect method can be the most informative approach, as it requires a follow-up question in each
choice task. But, only if the middle anchor point is not used too much.
When valuations are all above or below the threshold, it appears as though the Status-Quo
method was less able to recover the true rank ordering of items. This can be explained by
observing that under this condition the Status-Quo method should be much less informative
about the rank ordering of items. As the Status-Quo option appears as an alternative in each
choice task, it will always be selected when valuations for the reaming items are either all good
or all bad (relative to the anchor). As such, we collect less information about the latent utility of
the tested items than we would with mixed valuations. An alternative design with the Status-Quo
option appearing in a subset of the scenarios (Sawtooth Software MaxDiff documentation) would
likely improve the rank order recovery for this method.
This finding implies that if we expect all of the items to be either all good or all bad (relative
to the anchor), we should not use the Status-Quo approach for Anchored MaxDiff or have the
threshold level appear in a subset of the tasks. That said, if we have strong knowledge of the
relative utility of the items with respect to the anchor, we do not need to use Anchored MaxDiff
in the first place.
Figure 8
Preference Location
Threshold Classification
(Percent Misclassified)
4.0%
Rank Ordering
(Percent Missed)
5.0%
4.0%
3.0%
3.0%
2.0%
2.0%
1.0%
1.0%
0.0%
0.0%
All Bad All Good
Indirect Method
Direct Method
Mixed
All Bad All Good
Mixed
Status-Quo
Indirect Method
Direct Method
Status-Quo
313
Respondent Consistency
Respondent consistency is manipulated by increasing or decreasing the average attribute
utility relative to the size of logit error scale. In the high-consistency condition, the average size
of the preference parameters is large, thus allowing respondents to be very deterministic in their
choices. In the low-consistency condition, the average size of the simulated parameters is very
low relative to the error scale, thus increasing the perceived randomness in subject responses. For
completeness we include a medium consistency condition.
Results for this exercise are presented in Figure 9. When respondents are consistent in their
choices (i.e., the high consistency condition) all three approaches perform well in both relative
and absolute terms. However, in the medium- and low-consistency conditions the Direct Method
performs much worse than either the Indirect or Status-Quo approaches.
It is our belief that this is a direct result of the design of the Direct Method. In the Direct
Method all information about the relative preference of an attribute with respect to the anchor is
captured through a single direct binary question presented at the end of the survey. If respondents
are inconsistent or error prone (as is the case in the medium and low conditions), a mistake in
this last set of questions will have serious implications for our ability to recover the true
threshold classification or rank ordering. In the Indirect and Status-Quo approaches, each item
has the potential to be classified relative to the anchor multiple times. As such, any single error
on the part of a respondent will not have as large of an impact on parameter estimation.
This result implies that all three approaches are likely to work well if respondents are
consistent in their choices. However, if respondents are inconsistent in their choices the Direct
Method is likely to perform much worse than either the Indirect or Status-Quo approaches. This
should be an important consideration in selection of an Anchored MaxDiff approach as it is
difficult to forecast respondent consistency prior to the execution of a study.
Figure 9
Respondent Consistency
Threshold Classification
(Percent Misclassified)
Rank Ordering
(Percent Missed)
70.0%
16.0%
14.0%
12.0%
10.0%
8.0%
6.0%
4.0%
2.0%
0.0%
60.0%
50.0%
40.0%
30.0%
20.0%
10.0%
0.0%
High
Indirect Method
314
Medium
Direct Method
Low
Status-Quo
High
Indirect Method
Medium
Direct Method
Low
Status-Quo
CONCLUSION
Taken collectively, our analysis implies the following:




Under “regular” circumstances all 3 anchoring techniques perform well. If the anchor
location is not at an extreme, respondent consistency is high, and the number of attributes
being tested is low, all 3 techniques are virtually indistinguishable in the simulation
results.
All else being equal the Status-Quo anchoring technique is the easiest to understand and
implement. The Status-Quo method is also easier for an analyst to implement, as it does
not require additional manipulation of the design matrix prior to estimation.
The Direct Method should be avoided when respondent consistency is expected to be
low.
As the number of attributes increases the Direct Method outperforms the other
approaches in simulation. However, implementation of this method with many attributes
will substantially increase respondent burden.
LIMITATIONS AND AREAS FOR FUTURE RESEARCH
Error rate the anchor judgments
We assumed zero error in the anchor judgments for all of the simulations. Using different
rules for each approach to set the anchor error rate is not a good option as the assumptions used
will likely determine the results. For each test we systematically controlled the error for the
choice exercises to see if the approaches differed in accuracy with varying amounts of error.
However, since the anchors judgments are completely different tasks there is not a way to
systematically vary the same amount of error across the 3 approaches. A novel approach to fairly
assign error across the anchor tasks would benefit future research.
Appearance rate of the Status-Quo option
In our simulation of the Status-Quo method, the anchor was included in each of the choice
tasks. However the Status-Quo method is often performed by incorporating the anchor as one of
many attributes in the experimental design, which leads to its inclusion in a smaller subset of the
tasks. It will be interesting to see how well the Status-Quo method does as the frequency with
which the anchor appears varies. Having the anchor appear less often might alleviate some of the
error introduced when most of the items are above or below the anchor. Future research can
easily be adapted to test this hypothesis.
315
Jake Lee
316
Jeffrey P. Dotson
BEST-WORST CBC CONJOINT APPLIED TO SCHOOL CHOICE:
SEPARATING ASPIRATION FROM AVERSION
ANGELYN FAIRCHILD
RTI INTERNATIONAL
NAMIKA SAGARA
JOEL HUBER
DUKE UNIVERSITY
WHY USE BW CBC CONJOINT TO STUDY SCHOOL CHOICE?
Increasingly, school districts are allowing parents to have a role in selecting which school
they would like their children to attend. These parents make high-impact decisions based on a
number of complex criteria. In this pilot study, we apply best-worst (BW) CBC conjoint methods
to elicit tradeoff preferences for features of schools. Given sets of four schools defined by eight
attributes, parents indicate the school they would like most and least for their child. We prepare
respondents for these conjoint choices using introductory texts, reflective questions, and practice
questions that gradually increase in complexity. Despite a relatively small sample size (n=147),
we find that Hierarchical Bayes (HB) preference weight estimates are relatively stable when
best-worst choices are estimated either separately or jointly. Further we find that the “best”
responses reflect relatively balanced focus on all attribute levels, while “worst” responses focus
more on negative levels of attributes. Thus, the “best” responses reflect parents’ aspirations while
“worst” responses reflect their aversions.
Historically, most students in the United States were assigned to a school based on their
home address. Choices about schooling were typically a byproduct of housing location in which
the perceived quality of the school was a major factor in a family’s housing choice. Recently, in a
number of areas, including Boston and Charlotte, school districts allow parent’s choice in
determining which school their children will attend. The structure of these choice contexts varies
greatly—some districts are expanding options for competitive magnet or charter schools, while
others allow all parents to select their preferred school from a menu of options. These decisions
are complex, high impact, and low frequency, making them, as we will propose, ideal for bestworst analysis. The best-worst analysis can be used to complement the analysis of actual school
choice. While there have been studies of the outcomes of actual school choice programs
(Hastings and Weinstein 2008, Philips, Hausman and Larson 2012), this is, to our knowledge, the
first best-worst CBC analysis of school choice.
Other similarly complex and high-impact decisions, including decisions over surgical options
(Johnson, Hauber et al. 2010), cancer treatments (Bridges, Mohamed et al. 2012), and long term
financial planning (Shu, Zeithammer and Payne, 2013), have been successfully studied using
conjoint methods. For decisions like these, observational data may be weak or unavailable, or
only available retrospectively, making preferences difficult to ascertain. When program planning
and efficacy depend on anticipating choices, choice based conjoint is especially helpful because
it provides an opportunity to understand choice preferences prospectively in a controlled
experimental environment.
317
We developed a best-worst CBC conjoint survey following a format developed by Ely Dahan
(2012). In our task parents are given eight sets of four school options and asked for each to
indicate the school they would like best and the one they would like least for their child. There
are two advantages of the best-worst (BW) task for our purposes. First, the task generates 16
choices from a display of eight choice sets, and thus has the potential to more efficiently assess
preferences. Second, and more important, it is possible to make separate utility assessments of
the Best and the Worst choices. In the context of school choice, it is possible that both best and
worst choices reflect very similar utilities, differing only by scale, but we will show that in fact
they substantively differ.
We will also demonstrate the predictive value of the conjoint task by positioning the
attributes in a two-dimensional space and projecting characteristics of respondents into that
space. While a two dimensional space does not capture the rich heterogeneity in the relationships
between respondent characteristics and choice preferences, it nevertheless provides insight into
important differences across respondents in their school choices.
Finally, we will show that measures of respondent values can be generated with different
analytical models and that these measures are fairly consistent between models. In particular, we
pit results from a Hierarchical Bayes (HB) estimate against a simple linear probability model in
which preferences for the attribute levels are assumed to be linearly related. The consistency
between the two models suggests that the simpler linear probability estimates could be used on
the fly to provide respondents with feedback on their personal importance weight estimates.
Below, we describe our survey development and administration in more detail, followed by a
discussion of the methods used to analyze the choice data. We then discuss the results of our
analysis and characterize contexts in which BW conjoint makes sense.
SURVEY DESIGN AND IMPLEMENTATION
The goal of the survey was to assess whether conjoint methods could be used to simulate
real-world choices for schooling options, and to provide meaningful importance weight estimates
based on stated preferences. In school choice programs, parents evaluate schools based on a
broad range of criteria. Attributes included in the study were based on the school characteristics
that were publicly available through official “school report cards,” online school quality ratings,
and school websites. After identifying a list of salient school features, we identified plausible
ranges reflecting actual variation in each attribute. Attributes were not included in the final
choice questions if the actual variance was small. Examples of variables dropped for that reason
include the number of days in school, length of the school day, and teacher credentials. We pretested the survey with a convenience sample of 15 parents in order to assess survey
comprehension and to further narrow a long list of attributes down to the final eight.
The choice scenarios asked respondents, “Suppose that you have just moved to a new area
where families are able to choose which school they would most like their children to attend.”
Parents were instructed to select the school they liked most and the school they liked least from
sets of four in each of the eight choice questions. To acclimate them to the choice scenario and
question format, respondents completed a series of practice questions that increased in difficulty
as new attributes were introduced. The simplest practice choice included only two continuous
attributes and two schools, teaching respondents how to indicate their most and least liked
school. The next practice choice included four continuous attributes and four schools, and
318
featured one clearly superior profile and one clearly inferior profile as a test for respondent
comprehension. The full BW conjoint questions included four continuous and four binary
attributes similar to that shown in Figure 1.
Figure 1: Sample B-W Choice Task
Respondents were introduced to the survey attributes one at a time. The attributes and levels
included in the final survey are shown in Table 1 and Table 2. For each attribute, we defined the
attribute and listed the possible range of levels. In addition, we asked a series of questions about
the respondent’s experience with that feature at their child’s current school. This reinforcement
encourages respondents to think about each feature and relate the new information to their own
experience. Symbols representing each binary feature were introduced as part of the attribute
descriptions. Respondents were also quizzed to test and reinforce their comprehension of the
symbols, and in our sample 90% of respondents got all of these questions right.
Table 1: Continuous attributes
Attribute
Travel Time
Levels
5 minutes
15 minutes
30 minutes
45 minutes
Description
One way that the schools will differ for you is how long it
will take for your child to get to school by public school bus.
All schools have bus transport from your home, but the bus
ride can take from 5 to 45 minutes. Of course your child does
not have to take the bus, but for these choices you can
assume that driving will take approximately the same time
319
Academic:
Percent under
grade level
15%
25%
35%
45%
Economic:
Percent
economically
disadvantaged
10%
30%
50%
70%
Percent
minority
25%
40%
55%
70%
The schools also differ in terms of academic quality. Every
year, students in public schools take tests that measure if
their skills are sufficient for their grade level. If students’
scores are not high enough on these tests they are considered
below grade level. Every school has some students who are
below grade level. Among the schools in your new area,
schools could have as few as 15% (15 out of 100) and as
many as 45% (45 out of 100) of students below grade level.
The schools in your new area also differ in the percent of
students that are economically disadvantaged. Often children
from low income families can get help paying for things like
school lunches, after school programs, and school fees. The
percent of students at a school who are economically
disadvantaged is different in every school and could range
from 10% to 70% of students (10 to 70 out of every 100
students).
The schools in your new area also differ in racial diversity.
Students come from many racial or ethnic backgrounds. The
percent of students who are defined as minorities (African
American, Hispanic/Latino, American Indian or people from
other racial or ethnic backgrounds besides Caucasian or
White) ranges from 25% to 70% (25 to 70 out of 100
students).
Table 2: Binary features
Attribute
Promote sports
teams
Symbol
Description
All of the schools in your new area have physical education
classes, where children play sports and exercise with the rest
of their class.
Some schools encourage students to join sports teams. When
students join sports teams, they practice every day after
school with other students their age, and play games against
other schools.
In your new area, students at schools with sports teams can
choose to play various sports, such as basketball, volleyball,
wrestling, baseball, track and field, football, or soccer.
School sports teams do not cost any money, but students on
sports teams may not be able to ride the bus home and often
need a ride home from practice and games in the evening.
320
International
Baccalaureate
(IB) program
Science,
Technology,
Engineering,
and
Math
(S.T.E.M)
Expanded Arts
Program
Some schools offer the International Baccalaureate (IB)
program, which connects students all around the world with
a shared global curriculum.
The IB program emphasizes intellectual challenge,
encouraging students to make connections between their
studies in traditional subjects and to the real world. It fosters
the development of skills for communication, intercultural
understanding and global engagement, qualities that are
essential for life in the 21st century.
Some schools offer special classes in science, technology,
engineering, and math (S.T.E.M.). Students in these classes
get extra practice with science and math, working on
computers, and doing projects that use science and math
skills. These classes help students to be more prepared for
advanced classes, college, and jobs.
Some schools have an expanded arts program that allows
children to practice their artistic abilities. When schools have
expanded arts programs, students can choose to take classes
like theater, dance, choir, band, painting, drawing, or
ceramics. Students who take these classes may participate in
performances and in competitions against other schools.
In addition to the attribute definitions, reflective questions, practice choice questions, and
actual choice questions, the survey also included a series of background questions. These
included questions on the respondents’ demographic background, their school-age children, and
the degree of their involvement with their child’s education. We also included self-explicated
importance questions, in which respondents rated the importance of each of the 8 attributes
included in the choice questions.
For all responses, we used an efficient fixed choice design with level balance and good
orthogonal properties that was built using SAS modules (Kuhfeld and Wurst 2012).
ANALYSIS
The survey was administered online to a U.S. nationwide sample of 147 SSI panelists.
Respondents were prescreened to include only parents who expect to send a child to a public
middle or high school (grades 6 through 12). The sample was 57% female, 59% white, with 59%
having had some college and 50% having an income of under $50,000 per year.
In order to differentiate between choice preferences for the most vs. least liked school we
conducted independent HB analyses via Sawtooth Software of the eight Best and of the eight
Worst responses. These results, shown in Figure 2, demonstrate substantial differences between
these two tasks. In particular, we see that the least liked school puts more emphasis on avoiding
the worst levels of each of the 4-level attributes. Put differently, the worst judgments do not
differentiate as much between the best and second-best levels of the attributes but strongly
differentiate between the worst and the second worst levels. Such a pattern is consistent with
321
respondents in the Worst task avoiding the extreme negative features following a relatively noncompensatory process.
There also is a difference in the mean importance among the attributes as measured by the
average of the individual utility differences. In both conditions the most important attribute is
clearly academic quality. However, travel time is in second position for the Best choices but both
percent economically disadvantaged and percent minority are more important for the Worst
choices. Thus, when choosing the most liked school travel time is more important than racial
considerations, whereas the reverse happens when trying to avoid the worst school.
Figure 2
Part-worth Values for Most Liked Schools (solid squares)
and Least Liked Schools (dotted circles)
Figure 2 characterizes the part-worth utilities averaged across respondents. However, it does
not represent preference heterogeneity across parents. To better understand this heterogeneity, we
computed normalized individual level importance scores for each of the attributes in each of the
Best and Worst tasks and submitted them to a principal components analysis. Normalized
importance scores were calculated for each respondent as the difference between the most and
least preferred levels within an attribute, represented as a percentage of the sum of these
differences across all eight attributes. Figure 3 provides a two-dimensional solution to the
principal components analysis in which positively correlated attributes are grouped together. The
normalized importances from the Best choices are represented as squares while the Worst are
322
represented as circles. These points reflect the factor loadings of the two-dimensional factor
solution.
Figure 3
Principal components representation in two dimensions of eight attribute importance
measures from the Best (squares) and the Worst (circles) judgments
The vertical dimension is anchored at the top by academic quality as defined by the percent
of students below grade level. Notice that both the Best and Worst measures load about equally
on that dimension. The attributes at the bottom of the vertical dimension are Sports and Arts,
reflecting their generally negative correlations (around -.4) with academic quality. The map thus
suggests that those who value sports are less likely to care about the percent of people in the
class that test below grade level, and those who place a high value on student test scores place
less importance on sports and arts.
The degree of fit for the loadings in this space is linearly related to their distance from the
origin. Using that criterion, the results from the Best and Worst choices both span the entire
space, although Best fits slightly better than Worst.
Figure 4 projects vectors representing respondent characteristics onto the orthogonal factor
scores to reflect the correlations between respondent demographics and preference patterns.
323
These respondent characteristic vectors show that better educated parents who are employed full
time are more likely to value schools with high academic quality, while those with part time
employment and less education are more likely to prefer schools with strong sports or arts
programs.
Figure 4
Projections of respondent characteristic vectors into the two-dimension
principal components space
The horizontal dimension contrasts those who care about what is taught in the school with
those who are concerned with who is in the school. To the right are those for whom the content
areas of IB and STEM are important, populated largely by minority parents and those with older
children. To the left are parents concerned with the number of students from economically
disadvantaged family backgrounds and with the percent minority representation at the school.
This latter group of parents tend be white, have younger children and more income.
There are two simplifications in the choice map given in Figures 3 and 4. First, we are
summarizing eight attributes into a two-dimensional space. While the space accounts for 34% of
324
the variation, much variation is still left out. For example, this map could be expanded to include
a three dimensional space that focuses on bus travel time, yielding additional insights into the
interplay between preferences and demographic characteristics. The second simplification is that
the map is generated from importance scores, which replace the information on the 4-level
attributes with one importance measure reflecting the range of those part-worths. Little
information is lost for monotone attributes like percent of students under grade level and bus
travel time. However, for more complex attributes like percent minority or percent economically
disadvantaged, around 25% of respondents found an interior level to be superior to both extreme
levels. For these respondents, the correlation map shown in Figure 3 fails to capture the implied
“optimal” point for these attributes.
Finally, we note that the analysis on Figure 3 is generated from the separate Best and Worst
choices. A very similar result emerges with the combined values as well, except that the
combined values tend to have slightly higher loadings and marginally greater correlations with
parent characteristics.
LINEAR PROBABILITY INDIVIDUAL CHOICE MODEL
In Dahan’s (2012) study he estimates individual choice models for each person in real time,
rather than batch processing the entire set of results using Hierarchical Bayes. We use a similar
simple linear probability model to produce a different estimate of individual level preferences.
The linear model entails three heroic assumptions. First, probability of choice is treated as a
linear dependent variable even though it should be represented with a logit or a probit function.
Second, part-worth functions for the four-level attributes are linearized, thus ignoring any
curvature. Finally, coefficients of the attributes for the “worst” choices are assumed to be the
negative of the coefficients for the “best” choices. These assumptions are sufficiently troubling
that one would not build such a model except, as Dahan suggests, where there is a need to give
respondents real-time feedback on the linear importance of the choices made. We will show that
in spite of its questionable assumptions, the linear probability choice model shows a surprising
correspondence to the appropriate Hierarchical Bayes results.
The process of building the linear probability model is simple and follows from the general
linear model for our study with 8 attributes and 8 Best-Worst choices from sets of four profiles.



Define choice vector Y with 32 items, 4 for each choice set, coding Best as a 1, Worst as 1 and zero otherwise
Zero center the design matrix, X(32,8), within each choice set
Estimate = (X’X)-1X’Y, by multiplying (X’X)-1X’ (8 x32) by Y(32x1) to get (8x1)
The resulting  row vector reflects the linear impact of a unit change in each of the attributes
on the probability a person will choose the item. This calculation is sufficiently simple that it
could be programmed in to a computer-based survey and multiplied by the respondent’s choice
vector Y to produce linear importance estimates. This is particularly true when, as in our case,
the design is fixed, so (X’X)-1X’ need be computed only once and can be done in advance. With
a random design, it might be necessary to do this inversion and multiplication on the fly for each
respondent. Our pilot survey did not implement this immediate calculation step; however it
would be possible to do so.
325
Given the questionable assumptions required to support this simplified model, it is natural to
wonder how well the linear probability results compare with the HB results. Figure 5 shows the
correlation between the combined best-worst HB and the linear probability model. For most
attributes, including academic quality, percent minority, travel time, sports, STEM and arts,
correlation coefficients were relatively high and ranged between .70 and 90. Linear and HB
estimates were less correlated for IB and Economic welfare of the respondent.
Figure 5
Correlation of Importances between HB and Linear Choice Models
We also compared the HB and linear estimates of choices with a direct self-explicated
measure of importance. Our survey included self-explicated importance questions asking how
important each attribute was on a 7-point scale. We zero centered these self-explicated
importance ratings within each respondent and computed correlations with the linear and HB
importance scores. The correlations are relatively poor for both models, which may reflect the
difficulty people generally have with such direct assessments of importance. However, for four
out of eight attributes HB is more correlated with the self-explicated data than is the linear
model; for two attributes the linear model is more correlated, and for two attributes the HB and
linear models are about equally correlated with the self-explicated importances.
There are a number of contexts where the ability to give feedback on the fly would be useful.
Consider a BW conjoint study that asks cardiac patients about their tradeoffs between surgery
options that vary on the extent of the operation, likelihood of success, risk of complications,
recovery time and out-of-pocket costs. Patients could then be given an immediate summary of
what is important to them from the linear probability model, and asked if they would like to
326
change any of its values. Then the patients would have the option of sending those adjusted
values to the surgeon who would meet with them to help make their surgery decision.
Figure 6
Correlation of HB (dark) and Linear (light) Importance Estimates with
Direct Self-explicated Estimates of Importance
CONCLUSIONS
There are a number of surprising conclusions that are suggested by this pilot study of school
choice. These relate to the contexts in which best-worst conjoint is appropriate, the kind of
analyses that should be used with this data, and the applicability of the results in a real-world
context. Each of these is discussed below.
BW conjoint is most appropriate for choices where respondents may have both strong desires
for, and aversions to, specific attribute levels. Aversion is typically not an issue for choices
among package goods such as breakfast cereal, or short-lived experiences such as weekend
vacations. In those cases where there are many options and few long-term risks, people quickly
focus on trading off what they want rather than focusing on what they do not want. For example,
one would not learn much about person’s cereal purchase by knowing that their least favorite is
Coco-Puffs. In such cases, one might consider including first and second choices, but evidence
indicates that it may not be worth the extra respondent time required (Johnson and Orme 1996).
However, choices with desired positive and unavoidable negative features are ideal for bestworst CBC. These include medical choices which combine positive and negative features,
housing choices in which any decision has clear advantages and disadvantages, and in our case,
327
school choices. Our results clearly show that the part-worth patterns for the Worst option are
quite different from the Best. Asking respondents to think about their least preferred alternative
reveals what they most want to avoid, and such avoidance behavior may override the otherwise
positive features of an alternative. Thus, the two perspectives provide a chance to better
understand the mechanisms of choice.
We tested and compared several methods for estimating the importance of the attributes to
individual respondents. While the HB models produced a better fit and required fewer
questionable assumptions, the linear model generated a surprisingly highly correlation with the
HB model. While not perfect, a linear model may be useful where there is a desire to give
immediate feedback to the respondent. Such feedback could be used as a decision aid for people
in the process of making a high-impact, low frequency decision.
Our analysis demonstrates not only that conjoint methods can be used to elicit preferences for
school choices, but also that best-worst choices provide additional salient insight into what
parents are willing to trade off in choosing schools. Similar conjoint studies could be used in
school district planning or designing school choice policies. For example, the results from this
pilot survey suggest that schools that have difficulty attracting strong academic students might
increase enrollment among some demographic groups by developing strong arts or sports
programs. Alternatively, school district planners might use a school choice simulation within
their target community to generate school choice sets that maximizes the probability that children
attend a desired school and minimizes the likelihood of being assigned to an undesired school. In
sum, best-worst CBC is an important method of dealing with high impact, long-term decisions.
Joel Huber
REFERENCES
Bridges, John F., A. F. Mohamed, et al. (2012). “Patients’ preferences for treatment outcomes for
advanced non-small cell lung cancer: A conjoint analysis.” Lung Cancer 77(1): 224–231.
Dahan, Ely (2012) “Adaptive Best-Worst (ABC) Conjoint Analysis,” 2012 Sawtooth Software
Conference Proceedings, 223–236.
Hastings, Justine S., and Jeffrey M. Weinstein (2008) “Information, school choice, and academic
achievement: Evidence from two experiments.” The Quarterly Journal of Economics 123.4,
1373–1414.
328
Johnson, Richard M. and Bryan K. Orme (1996) “How Many Questions Should You Ask in
Choice-Based Conjoint Studies,”
www.sawtoothsoftware.com/download/techpap/howmanyq.pdf
Johnson, F. Reed, B. Hauber, et al. (2010). “Are gastroenterologists less tolerant of treatment
risks than patients? Benefit-risk preferences in Crohn’s disease management.” Journal of
Managed Care Pharmacy 16(8): 616–628.
Kuhfeld, Warren F. and John C. Wurst (2012) “An Overview of the Design of Stated Choice
Experiments,” 2012 Sawtooth Software Conference Proceedings, p. 165–194.
Phillips, Kristie JR, Charles Hausman, and Elisabeth S. Larsen (1212) “Students Who Choose
And The Schools They Leave: Examining Participation in Intradistrict Transfers.” The
Sociological Quarterly 53.2, 264–294.
Shu, Suzanne., Robert Zeithammer and John Payne (Working paper, UCLA). “Consumer
Preferences of Annuities: Beyond NPV.”
329
DOES THE ANALYSIS OF MAXDIFF DATA REQUIRE SEPARATE
SCALING FACTORS?
JACK HORNE
BOB RAYNER
MARKET STRATEGIES INTERNATIONAL
Scale of the error terms around MaxDiff utilities varies between best and worst responses.
Most estimation procedures, however, assume that scale is fixed, leading to potential bias in the
estimated utilities. We investigate to what degree scale actually does vary between response
categories, and whether true utilities may be better recovered by properly specifying scale when
estimating combined best-worst utilities.
BEST-WORST SCALING: TWO RESPONSE TYPES, ONE SET OF UTILITIES
Maximum-difference (MaxDiff) or Best-Worst scaling has been a widely used technique
among market researchers since it was first introduced more than 20 years ago by Jordan
Louviere (Louviere, 1991; Cohen and Orme, 2004). The technique involves repeated choices of
best and worst items in tasks where only small subsets (usually 4 at a time) of the total number of
items are presented. A single vector of utilities is estimated from the choice data, equal in length
to J - 1 where J is the total number of items, using a MNL model. To account for some items
being selected as best and others as worst, response data are coded as 1 for best choices and as -1
for worst choices. The selection of an item as worst in other words leads to a more negative
utility for that item, while the selection of an item as best leads to a more positive utility.
Figure 1. Best and worst utilities, estimated separately.
y=x
least squares
This analytic framework assumes that the best and worst choices used to generate the single
set of utilities follow the same distribution. In point of fact though, we often see that best and
worst utilities (estimated separately) are distributed differently. Figure 1 shows this idea for a
data set with 17 attributes. The range of the worst utilities is wider than the range of the best
331
utilities by a factor of 1.3. This suggests that there is less error around the worst utilities in this
data set, leading to a greater deviation from 0 in the estimated utilities for those responses. The
above-described analytic framework does not take into account this distributional difference
between best and worst responses.
Dyachenko et al. (2013a, 2013b) have found similar patterns in other data sets. They suggest
that best and worst responses may follow different distributions as a result of a sequential effect,
respondents are more accurate in later choices, and an elicitation effect, respondents are more
certain about what they like least. Regardless of the cause, what effect do these psychological
processes have on estimation of a single set of best-worst utilities?
MATHEMATICS OF CHOICE PROBABILITIES AND ERROR DISTRIBUTIONS
Utilities are typically estimated from best-worst choice data using a logit model. That model
is described as:
,
where
is the observable part of the utility, often
, and
is the unobservable error
surrounding it (Train, 2007). Errors ( ) are Gumbel distributed, and have a mode parameter
and a positive scale parameter . As increases, the magnitude of estimated utilities decreases,
all else being equal.
Delving further, choice probabilities from the logit model were defined by McFadden (1974)
as follows, where Pni is the probability that respondent n will choose item i:
The typical analytic framework used in estimating MaxDiff utilities assumes that scale is
fixed, and drops out of the above equation. However, as we and others have seen, is not
always fixed across best and worst responses (the ranges of these utilities, estimated separately,
can vary considerably). In this paper, we investigate, using simulations and actual MaxDiff data,
whether failure to account for these different scales affects the estimated utilities.
SIMULATIONS
Seventeen utilities were generated, ranging from -8/3 to +8/3 (equally spaced, and common
across 10,000 “respondents.” Error around best responses is distributed as Gumbel Type I and
error around worst responses is distributed as Gumbel Type II (Orme, 2013). The Gumbel Type I
CDF is
Therefore, Gumbel Type I error was added to the above utilities to form distributions of
“best” utilities using the following construct:
where
is a random uniform variant ranging from 0 to 1. Scale ( ) was uniformly set
at 1 to form best utilities.
332
Gumbel Type II error is defined as
Given this relationship, error was subtracted from generated utilities to form distributions of
worst utilities, using the same construct used in forming best utilities. Scale ( ) was varied in
forming worst utilities.
All of these utilities, two 10,000 “respondents” by 17 items matrices, were then converted to
best-worst choice data using a design consisting of 10 tasks per “individual” and 4 items per task
= 100,000 best responses and 100,000 worst responses. There were 8 versions in the design;
“respondents” were randomly assigned to version.
Finally, new best-worst combined utilities were estimated from the choice data by
maximizing log likelihood (
) in a MNL model, varying the assumed scale parameter for
worst responses.
+
where
= 1 if “respondent”
chooses item ; else
= 0.
All estimation was at the individual level (HB) using custom R code developed by one of the
authors to account for scale ratios (package: “HBLogitR,” forthcoming on CRAN).
Results of several simulations are shown in Table 1. Utilities are best recovered when the
actual scale ratio among best and worst utilities is the same as that used in estimation (assumed).
When the assumed scale ratio (best/worst) is larger than the actual, estimated utilities are biased,
especially those nearest the lower end of the utility range.
Table 1. Estimated utilities from simulations and absolute deviances from actual.
Numbers in the header refer to actual best/worst ratio, and assumed best/worst ratio. Mean
absolute deviances are: 1/1 = 0.142; 2/2 = 0.075; 1/2 = 0.588.
Actual
-2.67
-2.33
-2.00
-1.67
-1.33
-1.00
-0.67
-0.33
0.00
0.33
0.67
1.00
1.33
1.67
2.00
2.33
2.67
1/1
-2.99
-2.51
-2.25
-1.82
-1.50
-1.05
-0.70
-0.33
-0.06
0.35
0.82
1.13
1.47
1.81
2.23
2.56
2.84
deviance
0.32
0.18
0.25
0.16
0.16
0.05
0.03
0.00
0.06
0.01
0.15
0.13
0.13
0.14
0.23
0.23
0.18
2/2
-2.86
-2.44
-2.16
-1.74
-1.41
-0.97
-0.67
-0.31
-0.03
0.35
0.73
1.09
1.37
1.77
2.11
2.46
2.74
deviance
0.19
0.10
0.16
0.07
0.07
0.03
0.01
0.02
0.03
0.02
0.07
0.04
0.04
0.11
0.12
0.12
0.08
1/2
-4.42
-3.58
-3.04
-2.30
-1.65
-0.95
-0.46
-0.01
0.21
0.77
1.11
1.53
1.87
2.23
2.58
2.91
3.21
deviance
1.75
1.25
1.04
0.63
0.32
0.05
0.21
0.32
0.21
0.44
0.44
0.53
0.54
0.56
0.58
0.58
0.54
If the assumed best/worst scale ratio is smaller than the actual, a similar bias occurs (not
shown) where there is greater deviance at the upper end of the utility range. A bias is clearly
333
present when the scale ratio assumed in analysis does not match the scale ratio among the true
utilities.
ACTUAL DATA
Two data sets were used to test whether accounting for scale differences between best and
worst responses removes any bias from estimated utilities. Data set 1 consisted of 17 items and
300 respondents. Each respondent evaluated 10 (quads) best-worst exercises. Data set 2
consisted of 20 items and 918 respondents. Each respondent in this data set evaluated 15 (quads)
best-worst exercises.
The scale parameter around utilities estimated from a logit model is not identified (BenAkiva and Lerman, 1985; Swait and Louviere, 1993; Train, 2007). However, given two (or more)
response types (e.g., best and worst) a scale ratio is estimable. There are several ways to estimate
this scale ratio. In this paper, we estimate scale ratios by first estimating best and worst utilities
independent of one another, and then regressing those utilities against one another, through the
origin. The regression coefficient from this equation becomes the estimate of scale ratio.
Another method, suggested by Kevin Lattery (personal communication) is to estimate scale
ratios as the ratio of standard deviations of best and worst utilities, again estimated independently
(e.g.,
). One advantage of using this method is that it has reciprocal properties (i.e.,
), which is not the case with the beta from a regression equation. Still
another method involves estimating scale ratios via maximum likelihood, along with the betas, in
a MNL model. We detail that method in a forthcoming paper (Rayner and Horne, forthcoming).
All of these methods are capable of estimating scale ratios at the individual level, and all
return similar results in terms of combined best-worst utilities when applied to the data sets used
in this paper. For the sake of consistency alone, the below results all use the regression through
the origin method.
As in the simulated data, all estimation was at the individual level (HB), applying individual
scale corrections using custom R code developed by one of the authors (package: “HBLogitR,”
forthcoming on CRAN).
Best and worst utilities differed from one another in both data sets when estimated separately
(Figure 2; Data set 1: best = 0.838 x worst, t[H0: best = worst] = 15.1; Data set 2: best = 0.802 x
worst, t[H0: best = worst] = 32.3). In both data sets, worst utilities tended to be distributed across
a wider range than best utilities, indicating less error around the former.
Figure 3 shows combined best-worst utilities from both data sets, estimated with and without
correcting for scale ratios. Employing the scale ratio correction made little difference in how
utilities were estimated (Data set 1: r=0.998, t=1127.2; Data set 2: r=0.995, t=1349.8). Failure to
adjust for scale ratios led to no more (or less) statistical bias than estimating two consecutive HB
runs on the same data, not correcting for scale ratios both times.
Nevertheless, there were still some small differences in the ranges of utilities and in hit rates
as a result of correcting for scale ratios in estimating combined best-worst utilities. Failure to
correct for scale ratios led to wider ranges of estimated utilities in both data sets (Table 2, top
panel). Further, the difference on the low end of the range was about twice as large as the
difference on the high end. Both of these findings are in keeping with what was found earlier in
334
simulations when best/worst scale ratio assumed in analysis is smaller than the actual—or in this
case, our estimation of the actual ratio.
Figure 2. Best and worst utilities, estimated separately; left panel: data set 1, n=5100
utilities; right panel: data set 2, n=18360 utilities. Fit lines are y=x and least squares (flatter
slope). The slope of the least squares lines relative to the diagonal is indicative of a wider
range among worst utilities.
Figure 3. Combined best-worst utilities, estimated with and without corrections for scale
ratios among response types; left panel: data set 1; right panel: data set 2.
There was also some, potentially systematic, bias in in-sample hit rates as a result of not
correcting for scale ratios. In-sample hit rates in this circumstance were defined as respondents
actually choosing an item in a task that their estimated utilities suggest they would. When
corrected for scale ratios, overall hit rates in both data sets were slightly worse than when not
corrected (Table 2, bottom panel). However, and perhaps more importantly, hit rates among
worst choices improved slightly with correction, and hit rates among best choices worsened. It is
possible that this resulted because the scale ratio correction usually involved an up-weight on
335
worst choices and a down-weight on best ones. The worst choices in effect became more
important in determining the combined estimated utilities, while the best choices became less
important.
Table 2. Aggregated combined best-worst utilities, estimated with and without corrections
for scale ratios among response types (top panel). In-sample hit rates; largest hit rates in
each group are in boldface (bottom panel).
Data set 1
with correction
without
Data set 2
with correction
without
min
-3.29
-3.47
min
-1.76
-1.99
Data set 1
with correction
without
Data set 2
with correction
without
Best tasks
2770 (92.3%)
2797 (93.2%)
Best tasks
11009 (79.9%)
11355 (82.5%)
1Q’tile
-1.05
-1.12
1Q’tile
-0.27
-0.30
3Q’tile
+1.88
+1.98
3Q’tile
+0.47
+0.51
Worst tasks
2806 (93.5%)
2783 (92.8%)
Worst tasks
11249 (81.7%)
11076 (80.4%)
max
+2.45
+2.57
max
+1.09
+1.19
All tasks
5576 (92.9%)
5580 (93.0%)
All tasks
22258 (80.8%)
22431 (81.4%)
Utility ranks showed very little difference whether corrected for scale ratios or not. To the
extent that utilities were ranked differently in the two different estimation methods (which was
rare), the differences were often only one or two rank positions. These differences were similar
again to what we would find if we estimated two consecutive HB runs on the same data under
the same rules.
So, there does appear to be a small bias in estimating utilities without correcting for scale
ratios; this bias further appears to occur in the direction we might expect from simulations. But,
the bias is not large enough to change any business decisions. The extra analysis required does
not seem to be worth the effort if the goal of doing so is to remove bias.
ANOTHER REASON TO ADJUST FOR SCALE RATIOS?
There may however be another reason that justifies the extra analysis: cleaning data of
particularly “noisy” respondents. Scale ratios can of course be estimated on an individual
respondent basis. If a respondent’s best/worst scale ratio approaches zero, or infinity, or is
negative, it is possible that that person has misunderstood the task (e.g., is selecting “next best”
instead of “worst”) or is otherwise providing “noisy” data. This person’s combined best-worst
utilities may be difficult to estimate.
This is apparent from Figure 4. There is a small group of respondents who have worst/best
scale ratios that are near zero and negative. The individual hit rates for these respondents are
particularly small when combined utilities are estimated with a correction for the scale ratio.
336
Figure 4. Hit rates and individual scale ratios (from data set 2). Those respondents with
individual worst/best scale ratios > 0.4 (n=863) have average hit rates of 82.7%; those with
individual worst/best scale ratios < 0.4 (n=55) have average hit rates of 51.8%.
Figure 5. Combined best-worst utilities, estimated with and without corrections for scale
ratios among response types; left panel: all respondents, n=918; right panel: respondents
with worst/best scale ratios > 0.4 (n=863).
These small hit rates result because the combined utilities for this group of respondents are
virtually inestimable when correcting for best/worst scale ratios. This can be seen in Figure 5.
Utilities for those respondents with small or negative worst/best scale ratios tend to be estimated
around zero when the correction is made, and add little but noise to the overall estimates (Figure
5, left panel). Removing those respondents and re-estimating utilities cleaned things up
considerably (Figure 5, right panel).
The same effect can be seen in the range of estimated utilities. Removing the “noisy”
respondents led to increased utility ranges, as we might expect to be the case (Table 3).
337
Table 3. Combined best-worst utilities, estimated with scale ratio correction for all
respondents in data set 2 (n=918), and for respondents with individual worst/best scale
ratios > 0.4 (n=863).
Data set 2
All respondents
Small scale ratios removed
min
-1.76
-2.05
1Q’tile
-0.27
-0.30
3Q’tile
+0.47
+0.54
max
+1.09
+1.25
The individual root likelihood (RLH) statistic has been suggested as one criterion for
cleaning “noisy” respondents from data (Orme, 2013). We compared individual worst/best scale
ratios to individual RLH statistics in one of the data sets and found no relationship (r=-0.01,
t[H0: r=0]=-0.185, p=0.853). It seems then, at least from examination of a single data set, that
using best/worst scale ratios provides a measure of respondent “noisiness” that can be used
independently from (and in concert with) RLH.
FINAL THOUGHTS
Best and worst responses from MaxDiff tasks do tend to be scaled differently. Whatever the
reason for this, not taking those differences into account when estimating combined best-worst
utilities appears to add some systematic bias, albeit a small amount, to those utilities. The
presence of this bias raises the question of whether or not to adjust for the scale differences
during analysis.
Adjustment requires additional effort, and that effort may not be seen as justified given the
lack of strong findings here and elsewhere. Practitioners, rightly or not, may feel that if there is
only a little bias and that bias will not change managerial decisions about the data, why not
continue with the status quo? It is a reasonable opinion to hold.
But, fundamentally, this bias results from having two kinds of responses—best and worst—
combined into a single analysis. Perhaps the better solution is to tailor the response used to the
desired objective. If the objective is to identify “best” alternatives, ask for only best responses;
likewise, if the objective is to identify “worst” alternatives, ask for only worst responses. Doing
so might better focus respondents’ attention on the business objective; we could include more
choice tasks since we would be asking for one fewer response in each task; and we would not
have to worry about the effect of different scales between response types.
The difficulty with this approach for practitioners would be in identifying, managerially,
what the objective should be (finding “bests” or eliminating “worsts”) in an age where we’ve
become used to doing both in the same exercise. This idea remains the focus of future research.
338
Jack Horne
REFERENCES
Ben-Akiva, M. and Lerman, S. R. (1985). Discrete Choice Analysis: Theory and Application to
Travel Demand. MIT Press, Cambridge, MA.
Cohen S. and Orme, B. (2004). What’s your preference? Marketing Research, 16, pp. 32–37.
Dyachenko, T. L., Naylor, R. W. and Allenby, G. M. (2013a). Models of sequential evaluation in
best-worst choice tasks. Advanced Research Techniques (ART) Forum. Chicago.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2072496
Dyachenko, T. L., Naylor, R. W., and Allenby, G. M. (2013b). Ballad of the best and worst. 2013
Sawtooth Software Conference Proceedings. Dana Point, CA.
Louviere, J. J. (1991). Best-worst scaling: A model for the largest difference judgments. Working
Paper. University of Alberta.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior, in P. Zarembka
(ed.), Frontiers in Econometrics. Academic Press. New York. pp. 105–142.
Orme, B. (2013). MaxDiff/Web v.8 Technical Paper.
http://www.sawtoothsoftware.com/education/techpap.shtml
Rayner, B. K. and Horne, J. (forthcoming). Scaled MaxDiff. Marketing Letters (submitted).
Swait, J. and Louviere, J. J. (1993). The role of the scale parameter in estimation and comparison
of multinomial logit models. Journal of Marketing Research, 30, pp. 305–314.
Train, K. E. (2007). Discrete Choice Methods with Simulation. Cambridge University Press. New
York. pp. 38–79.
339
USING CONJOINT ANALYSIS TO DETERMINE THE MARKET VALUE OF
PRODUCT FEATURES
GREG ALLENBY
OHIO STATE UNIVERSITY
JEFF BRAZELL
THE MODELLERS
JOHN HOWELL
PENN STATE UNIVERSITY
PETER ROSSI
UNIVERSITY OF CALIFORNIA LOS ANGELES
ABSTRACT
In this paper we propose an approach for using conjoint analysis to attach economic value to
specific features that is commonly used in many econometric applications and in intellectual
property litigation. A common approach to this task involves taking difference in utility levels
and dividing by the price coefficient. This is fraught with difficulties including a) certain
respondents projected to pay astronomically high amounts for features, and b) the approach
ignores important competitive realities in the marketplace. In this paper we argued that to assess
the economic value of a feature to a firm requires conducting market simulations (a share of
preference analysis) involving a realistic set of competitors, including the outside good (the
“None” category). Furthermore, it requires a game theoretic approach to compare the industry
equilibrium prices with and without the focal product feature.
1. INTRODUCTION
Valuation of product features is a critical part of the development and marketing of products
and services. Firms are continuously involved in the improvement of existing products by adding
new features and many “new products” are essentially old products which have been enhanced
with features previously unavailable. For example, consider the smartphone category of
products. As new generations of smartphones are produced and marketed, existing features such
as screen resolution/size or cellular network speed are enhanced to new higher levels. In
addition, features are added to enhance the usability of smartphone. These new features might
include integration of social networking functions into the camera application of the smartphone.
A classic example, which was involved in litigation between Apple and Samsung, is the use of
icons with rounded edges. New and enhanced features often involve substantial development
costs and sometimes also require new components which drive up the marginal cost of
production.
The decision to develop new features is a strategic decision involving not only the cost of
adding the feature but also the possible competitive response. The development and marketing
costs of feature enhancement must be weighed against the expected increase in profits which will
accrue if the product feature is added or enhanced. Expected profits in a world with the new
feature must be compared to expected profits in a world without the feature. Computing this
change in expected profits involves predicting not only demand for the feature but also assessing
341
the new industry equilibrium that will prevail with a new set of products and competitive
offerings.
In a litigation context, product features are often at the core of patent disputes. In this paper,
we will not consider the legal questions of whether or not the patent is valid and whether or not
the defendant has infringed the patent(s) in dispute. We will focus on the economic value of the
features enabled by the patent. The market value of the patent is determined both by the value of
the features enabled as well as by the probability that the patent will be deemed to be a valid
patent and the costs of defending the patent’s validity and enforcement. The practical content of
both apparatus and method patents can be viewed as the enabling of product features. The
potential value of the product feature(s) enabled by patent is what gives the patent value. That is,
patents are valuable only to the extent that they enable product features not obtainable via other
(so-called “non-infringing”) means.
In both commercial and litigation realms, therefore, valuation of product features is critical to
decision making and damages analysis. Conjoint Analysis (see, for example, Orme 2009 and
Gustafsson et al. 2000) is designed to measure and simulate demand in situations where products
can be assumed to be comprised of bundles of features. While conjoint analysis has been used for
many years in product design (see the classic example in (Green and Wind, 1989)), the use of
conjoint in patent litigation has only developed recently. Both uses of conjoint stem from the
need to predict demand in the future (after the new product has been released) or in a
counterfactual world in which the accused infringing products are withdrawn from the market.
However, the literature has struggled, thus far, to precisely define meaning of “value” as applied
to product features. The current practice is to compute what many authors call a Willingness to
Pay (hereafter, WTP) or a Willingness To Buy (hereafter, WTB). WTP for a product feature
enhancement is defined as the monetary amount which would be sufficient to compensate a
consumer for the loss of the product feature or for a reduction to the non-enhanced state. WTB is
defined as the change in sales or market share that would occur as the feature is added or
enhanced. The problem with both the WTP and WTB measures is that they are not equilibrium
outcomes. WTP measures only a shift in the demand curve and not what the change in
equilibrium price will be as the feature is added or enhanced. WTB holds prices fixed and does
not account for the fact that as a product becomes more valuable equilibrium prices will typically
go up.
We advocate using equilibrium outcomes (both price and shares) to determine the
incremental economic profits that would accrue to a firm as a product is enhanced. In general,
the WTP measure will overstate the change in equilibrium price and profits and the WTB
measure will overstate the change in equilibrium market share. We illustrate this using a conjoint
survey for digital cameras and the addition of a swivel screen display as the object of the
valuation exercise. Standard WTP measures are shown to greatly overstate the value of the
product feature. To compute equilibrium outcomes, we will have to make assumptions about cost
and the nature of competition and the set of competitive offers. Conjoint studies will have to be
designed with this in mind. In particular, greater care to include an appropriate set of competitive
brands, handle the outside option appropriately, and estimate price sensitivity precisely must be
exercised.
342
2. PSEUDO-WTP, TRUE WTP AND WTB
In the context of conjoint studies, feature valuation is achieved by using various measures
that relate only to the demand for the products and features and not to the supply. In particular, it
is common to produce estimates of what some call Willingness To Pay and Willingness To Buy.
Both WTP and WTB depend only on the parameters of the demand system. As such, the WTP
and WTB measure cannot be measures of the market value of a product feature as they do not
directly relate to what incremental profits a firm can earn on the basis of the product feature. In
this section, we review the WTP and WTB measures and explain the likely biases in these
measures in feature valuation. We also explain why the WTP measures used in practice are not
true WTP measures and provide the correct definition of WTP.
2.1 The Standard Choice Model for Differentiated Product Demand
Valuation of product features depends on a model for product demand. In most marketing
and litigation contexts, a model of demand for differentiated products is appropriate. We briefly
review the standard choice model for differentiated product demand. In many contexts, any one
customer purchases at most one unit of the product. While it is straightforward to extend our
framework to consider products with variable quantity purchases, we limit attention to the unit
demand situation, and develop our model for a single respondent. In addition, we begin by
considering a model for just one respondent. Extensions needed for multiple respondents are
straightforward (see Rossi et al., 2005).
The demand system then becomes a choice problem in which customers have J choice
alternatives, each with characteristics vector, xj, and price, pj. The standard random utility model
(McFadden, 1981) postulates that the utility for the jth alternative consists of a deterministic
portion (driven by x and p) and an unobservable portion which is modeled, for convenience, as a
Type I extreme value distribution.
xj is a k x 1 vector of attributes of the product, including the feature that requires valuation. xf
denotes the focal feature.
Feature enhancement is modeled as alternative levels of the focal feature, xf (one element of
the vector x), while the addition of features would simply have xf as a dummy or indicator
variable. There are three important assumptions regarding the model above that are important for
feature valuation: 1. This is a compensatory model with a linear utility. 2. We enter price linearly
into the model instead of using the more common dummy variable coding used in the conjoint
literature. That is, if price takes on K values, p1 pk , we include one price coefficient instead of
the usual K-1 dummy variables to represent the different levels. In equilibrium calculations, we
will want to consider prices at any value in some relevant range in order to use first order
conditions which assume a continuum. 3. There is a random utility error that theoretically can
take on any number on the real line.
The random utility error, εj, represents the unobservable (to the investigator) part of utility.
This means that actual utility received from any given choice alternative depends not only on the
observed product attributes, x, and price but also on realizations from the error distribution. In
the standard random utility model, there is the possibility of receiving up to infinite utility from
343
the choice alternative. This means that in evaluating the option to make choices from a set of
products, we must consider the contribution not only of the observed or deterministic portion of
utility but also the distribution of the utility errors. The possibilities for realization from the error
distribution provide a source of utility for each choice alternative.
In the conjoint literature, the β coefficients are called part-worths. It should be noted that the
part-worths are expressed in a utility scale which has an arbitrary origin (as defined by the base
alternative) and an equally arbitrary scaling (somewhat like the temperature scale). This means
that we cannot compare elements of the β vector in ratio terms or utilizing percentages. In
addition, if different consumers have different utility functions (which is almost a truism of
marketing) then we cannot compare part-worths across individuals. For example, suppose that
one respondent gets twice as much utility from feature A as feature B, while another respondent
gets three times as much utility from feature B as A. All we can say is that the first respondent
ranks A over B and the second ranks B over A; no statements can be made regarding the relative
“liking” of the various features.
2.2 Pseudo WTP
The arbitrary scaling of the logit choice parameters presents a challenge to interpretation. For
this reason, there has been a lot of interest in various ways to convert part-worths into quantities
such as market share or dollars which are defined on ratio scales. What is called “WTP” in the
conjoint literature is one attempt to convert the part-worth of the focal feature, βf, to the dollar
scale. Using a standard dummy variable coding, we can view the part-worth of the feature as
representing the increase in deterministic utility that occurs when the feature is turned on. For
feature enhancement, a dummy coding approach would require that we use the difference in partworths associated with the enhancement in the “WTP” calculation. If the feature part-worth is
divided by the price coefficient, then we have converted to the ratio dollar scale. We will call this
“pseudo-WTP” as it is not a true WTP measure as we explain below.
This p-WTP measure is often justified by appeal to the simple argument that this is the
amount by which price could be raised and still leave the “utility” for choice alternative J the
same when the product feature is turned on. Others define this as a “willingness to accept” by
giving the completely symmetric definition as the amount by which price would have to be
lowered to yield the same utility in a product with the feature turned off as with a product with
the feature turned on. Given the assumption of a linear utility model and a linear price term, both
definitions are identical. In practice, reference price effects often make WTA differ from WTP,
(see [Viscusi-Huber-2011]) but, in the standard economic model, these are equivalent. In the
literature (Orme-2001-WTP), p-WTP is sometimes defined as the amount by which the price of
the feature-enhanced product can be increased and still leave its market share unchanged. In a
homogeneous logit model, this is identical to the expression above.
Inspection of the p-WTP formula reveals at least two reasons why p-WTP formula cannot be
true WTP. First, the change in WTP should depend on which product is being augmented with
the feature. The conventional p-WTP formula is independent of which product variant is being
augmented due to the additivity of the deterministic portion of the utility function. Second, true
344
WTP must be derived ex ante—before a product is chosen. That is, adding the feature to one of
the J products in the market place enhances the possibilities for attaining high utility. Removing
the feature reduces levels of utility by diminishing the opportunities in the choice set. This is all
related to the assumption that on each choice occasion a separate set of choice errors are drawn.
Thus, the actual realization of the random utility errors is not known prior to the choice, and
must be factored into the calculations to estimate the true WTP.
2.3 True WTP
WTP is an economic measure of social welfare derived from the principle of compensating
variation. That is, WTP for a product is the amount of income that will compensate for the loss of
utility obtained from the product; in other words, a consumer should be indifferent between
having the product or not having the product with an additional income equal to the WTP.
Indifference means the same level of utility. For choice sets, we must consider the amount of
income (called the compensating variation) that I must pay a consumer faced with a diminished
choice set (either an alternative is missing or diminished by omission of a feature) so that
consumer attains the same level of utility as a consumer facing a better choice set (with the
alternative restored or with the feature added). Consumers evaluate choices a priori or before
choices are made. Features are valuable to the extent to which they enhance the attainable utility
of choice. Consumers do not know the realization of the random utility errors until they are
confronted with a choice task. Addition of the feature shifts the deterministic portion of utility or
the mean of the random utility. Variation around the mean due to the random utility errors is
equally important as a source of value.
The random utility model was designed for application to revealed preference or actual
choice in the marketplace. The random errors are thought to represent information unobservable
to the researcher. This unobservable information could be omitted characteristics that make
particular alternatives more attractive than others. In a time series context, the omitted variables
could be inventory which affects the marginal utility of consumption. In a conjoint survey
exercise, respondents are explicitly asked to make choices solely on the basis of attributes and
levels presented and to assume that all other omitted characteristics are to be assumed to be the
same. It might be argued, then, that the role of random utility errors is different in the conjoint
context. Random utility errors might be more the result of measurement error rather than omitted
variables that influence the marginal utility of each alternative.
However, even in conjoint setting, we believe it is still possible to interpret the random utility
errors as representing a source of unobservable utility. For example, conjoint studies often
include brand names as attributes. In these situations, respondents may infer that other
characteristics correlated with the brand name are present even though the survey instructions
tell them not to make these attributions. One can also interpret the random utility errors as arising
from functional form mis-specification. That is, we know that the assumption of a linear utility
model (no curvature and no interactions between attributes) is a simplification at best. We can
also take the point of view that a consumer is evaluating a choice set prior to the realization of
the random utility errors that occur during the purchase period. For example, I consider the value
of choice in the smartphone category at some point prior to a purchase decision. At the point, I
know the distribution of random utility errors that will depend on features I have not yet
discovered or from demand for features which is not yet realized (i.e., I will realize that I will get
345
a great deal of benefit from a better browser). When I go to purchase a smartphone, I will know
the realization of these random utility errors.
To evaluate the utility afforded by a choice set, we must consider the distribution of the
maximum utility obtained across all choice alternatives. This maximum has a distribution
because of the random utility errors. For example, suppose we add the feature to a product
configuration that is far from utility maximizing. It may still be that, even with the feature, the
maximum deterministic utility is provided by a choice alternative without the feature. This does
not mean that feature has no value simply because the product it is being added to is dominated
by other alternatives in terms of deterministic utility. The alternative with the feature added can
be chosen after realization of the random utility errors if the realization of the random utility
error is very high for the alternative that is enhanced by addition of the feature.
The evaluation of true WTP involves the change in the expected maximum utility for a set of
offerings with and without the enhanced product feature. We refer the reader to the more
technical paper by Allenby et al. (2013) for its derivation, and simply show the formula below to
illustrate its difference from the p-WTP formula described above:
where a* is the enhanced level of the attribute. In this formulation, the value of an enhanced
level of an attribute is greater when the choice alternative has higher initial value.
2.4 WTB
In some analyses, product features are valued using a “Willingness To Buy” concept. WTB is
the change in market share that will occur if the feature is added to a specific product.
where MS(j) is the market share equation for product j. The market share depends on the
entire price vector and the configuration of the choice set. This equation holds prices fixed as the
feature is enhanced or added. The market share equations are obtained by summing up the logit
probabilities over possibly heterogeneous (in terms of taste parameters) customers. The WTB
measure does depend on which product the feature is added to (even a world with identical or
homogeneous customers) and, thereby, remedies one of the defects of the pseudo-WTP measure.
However, WTB assumes that firms will not alter prices in response to a change in the set of
products in the marketplace as the feature is added or enhanced. In most competitive situations,
if a firm enhances its product and the other competing products remain unchanged, we would
expect the focal firm to be able to command a somewhat higher price, while the other firms’
offerings would decline in demand and therefore, the competing firms would reduce their price
or add other features.
2.5 Why p-WTP, WTP and WTB are Inadequate
Pseudo-WTP, WTP and WTB do not take into account equilibrium adjustments in the market
as one of the products is enhanced by addition of a feature. For this reason, we cannot view
either pseudo-WTP nor WTP as what a firm can charge for a feature-enhanced product nor can
346
we view WTB as the market share than can be gained by feature enhancement. Computation of
changes in the market equilibrium due to feature enhancement of one product will be required to
develop a measure of the economic value of the feature. WTP will overstate the price premium
afforded by feature enhancement and WTB will also overstate the impact of feature enhancement
on market share. Equilibrium computations in differentiated product cases are difficult to
illustrate by simple graphical means. In this section, we will use the standard demand and supply
graphs to provide an informal intuition as to why p-WTP and WTB will tend to overstate the
benefits of feature enhancement.
Figure 1 shows a standard industry supply and demand set-up. The demand curve is
represented by the blue downward sloping lines. “D” denotes demand without the feature and
“D*” denotes demand with the feature. The vertical difference between the two demand curves is
the change in WTP as the feature is added. We assume that addition of the feature may increase
the marginal cost of production (note: for some features such as those created purely via
software, the marginal cost will not change). It is easy to see that, in this case, the change in WTP
exceeds the change in equilibrium price. A similar argument can be made to illustrate that WTB
will exceed the actual change in demand in a competitive market.
Figure 1
Difficulties with WTP
347
3. ECONOMIC VALUATION OF FEATURES
The goal of feature enhancement is to improve profitability of the firm introducing product
with feature enhancement into an existing market. Similarly, the value of a patent is ultimately
derived from the profits that accrue to firms who practice the patent by developing products that
utilize the patented technology. In fact, the standard economic argument for allowing patent
holders to sell their patents is that, in this way, patents will eventually find their way into the
hands of those firms who can best utilize the technology to maximize demand and profits. For
these reasons, we believe that the appropriate measure of the economic value of feature
enhancement is the incremental profits that the feature enhancement will generate.
Profits, π, is associated with the industry equilibrium prices and shares given a particular set
of competing products which is represented by the choice set defined by the attribute matrix.
This definition allows for both price and share adjustment as a result of feature enhancement,
removing some of the objections to the p-WTP, WTP and WTB concepts. Incremental profits is
closer in spirit, though not the same, to the definition of true WTP in the sense that profits
depend on the entire choice set and the incremental profits may depend on which product is
subject to feature enhancement. However, social WTP does not include cost considerations and
does not address how the social surplus is divided between the firm and the customers.
In the abstract, our definition of economic value of feature enhancement seems to be the
appropriate measure for the firm that seeks to enhance a feature. All funds have an opportunity
cost and the incremental profits calculation is fundamental to deploying product development
resources optimally. In fairness, industry practitioners of conjoint analysis also appreciate some
of the benefits of an incremental profits orientation. Often marketing research firms construct
“market simulators” that simulate market shares given a specific set of products in the market.
Some even go further as to attempt to compute the “optimal” price by simulating different
market shares corresponding to different “pricing scenarios.” In these exercises, practitioners fix
competing prices at a set of prices that may include their informal estimate of competitor
response. This is not the same as computing a marketing equilibrium but moves in that direction.
3.1 Assumptions
Once that principle of incremental profits is adopted, the problem becomes to define the
nature of competition, the competitive set and to choose an equilibrium concept. These
assumptions must be added to the assumptions of a specific parametric demand system (we will
use a heterogeneous logit demand system which is flexible but still parametric) as well as a linear
utility function over attributes and the assumption (implicit in all conjoint analysis) that products
can be well described by bundles of attributes. Added to these assumptions, our valuation method
will also require cost information. Specifically, we will assume
1. Demand Specification: A standard heterogeneous logit demand that is linear in the
attributes (including price).
2. Cost Specification: Constant marginal cost.
3. Single product firms.
4. Feature Exclusivity: The feature can only be added to one product.
348
5. No Exit: Firms cannot exit or enter the market after product enhancement takes place.
6. Static Nash Price Competition: There is a set of prices from which each individual firm
would be worse off if they deviated from the equilibrium.
Assumptions 2, 3, 4 can be easily relaxed. Assumption 1 can be replaced by any valid
demand system. Assumptions 5 and 6 cannot be relaxed without imparting considerable
complexity to the equilibrium computations.
4. USING CONJOINT ANALYSIS FOR EQUILIBRIUM CALCULATIONS
Economic valuation of feature enhancement requires a valid and realistic demand system as
well as cost information and assumptions about the set of competitive products. If conjoint
studies are to be used to calibrate the demand system, then particular care must be taken to
design a realistic conjoint exercise. The low cost of fielding and analyzing a conjoint design
makes this method particularly appealing in a litigation context. In addition, with Internet panels,
conjoint studies can be fielded and analyzed in a matter of days, a time frame also attractive in
the tight schedules of patent litigation. However, there is no substitute for careful conjoint
design. Many designs fielded today are not useful for economic valuation of feature
enhancement. For example, in recent litigation, conjoint studies in which there is no outside
option, only one brand, and only patented features were used. A study with any of these
limitations is of questionable value for true economic valuation.
Careful practitioners of conjoint have long been aware that conjoint is appealing because of
its simplicity and low cost but that careful studies make all the difference between realistic
predictions of demand and useless results. We will not repeat the many prescriptions for careful
survey analysis which include thorough crafting questionnaires with terminology that is
meaningful to respondents, thorough and documented pre-testing and representative (projectable)
samples. Furthermore, many of the prescriptions for conjoint design including well-specified and
meaningful attributes and levels are extremely important. Instead, we will focus on the areas we
feel are especially important for economic valuation and not considered carefully enough.
4.1 Set of Competing Products
The guiding principle in conjoint design for economic valuation of feature enhancement is
that the conjoint survey must closely approximate the marketplace confronting consumers. In
industry applications, the feature enhancement has typically not yet been introduced into the
marketplace (hence the appeal of a conjoint study), while in patent litigation the survey is being
used to approximate demand conditions at some point in the past in which patent infringement is
alleged to have occurred.
Most practitioners of conjoint are aware that, for realistic market simulations, the major
competing products must be used. This means that the product attributes in the study should
include not only functional attributes such as screen size, memory, etc., but also the major
brands. This point is articulated well in (Orme, 2001). However, in many litigation contexts, the
view is that only the products and brands accused of patent infringement should be included in
the study. The idea is that only a certain brand’s products are accused of infringement and,
therefore, that the only relevant feature enhancement for the purposes of computing patent
damages are feature enhancement in the accused products.
349
For example, in recent litigation, Samsung has accused Apple iOS devices of infringing
certain patents owned by Samsung. The view of the litigators is that a certain feature (for
example, a certain type of video capture and transmission) infringes a Samsung patent.
Therefore, the only relevant feature enhancement is to consider the addition or deletion of this
feature on iOS devices such as the iPhone, iPad and iPod touch. This is correct but only in a
narrow sense. The hypothetical situation relevant to damages in that case is only the addition of
the feature to relevant Apple products. However, the economic value of that enhancement
depends on the other competing products in the marketplace. Thus, a conjoint survey which only
uses Apple products in developing conjoint profiles cannot be used for economic valuation.
The value of a feature in the marketplace is determined by the set of alternative products. For
example, in a highly competitive product category with many highly substitutable products, the
economic value or increment profits that could accrue to any one competitor would typically be
very small. However, in an isolated part of the product space (that is a part of the attribute space
that is not densely filled in with competing products), a firm may capture more of the value to
consumers of a feature enhancement. For example, if a certain feature is added to an Android
device, this may cause greater harm to Samsung in terms of lost sales/profits because smart
devices in the Android market segment (of which Samsung is a part) are more inter-substitutable.
It is possible that addition of the same feature to the iOS segment may be more valuable as Apple
iOS products may be viewed as less substitutable with Android products than other Android
products. We emphasize that these examples are simply conjectures to illustrate the point that a
full set of competing products must be used in the conjoint study.
We do not think it necessary to have all possible product variants or competitors in the
conjoint study and subsequent equilibrium computations. In many product categories, this would
require a massive set of possible products with many features. Our view is that it is important to
design the study to consider the major competing products both in terms of brands and the
attributes used in the conjoint design. It is not required that the conjoint study exactly mirror the
complete set of products and brands that are in the marketplace but that the main exemplars of
competing brands and product positions must be included.
4.2 Outside Option
There is considerable debate as to the merits of including an outside option in conjoint
studies. Many practitioners use a “forced-choice” conjoint design in which respondents are
forced to choose one from the set product profiles in each conjoint choice task. The view is that
“forced-choice” will elicit more information from the respondents about the tradeoffs between
product attributes. If the “outside” or “none of the above” option is included, advocates of forced
choice argue that respondents may shy away from the cognitively more demanding task of
assessing tradeoffs and select the “none” option to reduce cognitive effort. On the opposite side,
other practitioners advocate inclusion of the outside option in order to assess whether or not the
product profiles used in the conjoint study are realistic in the sense of attracting considerable
demand. The idea being that if respondents select the “none of the above” option too frequently
then the conjoint design has offered very unattractive hypothetical products. Still others (see, for
example, 9Brazell et al., 2006) argue the opposite side of the argument for forced choice. They
argue that there is a “demand” effect in which respondents select at least one product to “please”
the investigator. There is also a large literature on how to implement the “outside” option.
350
Whether or not the outside option is included depends on the ultimate use of the conjoint
study. Clearly, it is possible to measure how respondents trade-off different product attributes
against each other without inclusion of the outside option. For example, it is possible to estimate
the price coefficient in a conjoint study which does not include the outside option. Under the
assumption that all respondents are NOT budget constrained, the price coefficient should
theoretically measure the trade-offs between other attributes and price. The fact that respondents
might select a lower price and pass on some features means that they have an implicit valuation
of the dollar savings involved in this trade-off. If all respondents are standard economic agents in
the sense that they engage in constrained utility maximization, then this valuation of dollar
savings is a valid estimate of the marginal utility of income. This means that a conjoint study
without the outside option can be used to compute the p-WTP measure, which only requires a
valid price coefficient.
We have argued that p-WTP is not a measure of the economic value to the firm of feature
enhancement. This requires a complete demand system (including the outside good) as well as
the competitive and cost conditions. In order to compute valid equilibrium prices, we need to
explicitly consider substitution from and to other goods including the outside good. For example,
suppose we enhance a product with a very valuable new feature. We would expect to capture
sales from other products in the category as well as to expand the category sales; the introduction
of the Apple iPad dramatically grew the tablet category due, in part, to the features incorporated
in the iPad. Chintgunta and Nair (2011) make a related observation that price elasticities will be
biased if the outside option is not included. We conclude that an outside option is essential for
economic valuation of feature enhancement as the only way to incorporate substitution in and out
of the category is by the addition of the outside option.
At this point, it is possible to take the view that if respondents are pure economic actors that
they should select the outside option corresponding to their true preferences and that their
choices will properly reflect the marginal utility of income. However, there is a growing
literature which suggests that different ways of expressing or allowing for the outside option will
change the frequency with which it is selected. In particular, the so-called “dual response” way
of allowing for the outside option (see Uldry et al., 2002 and Brazell et al., 2006) has been found
to increase the frequency of selection of the outside option. The “dual-response” method asks the
respondent first to indicate which of the product profiles (without the outside option) are most
preferred and then asked if the respondent would actually buy the product at the price posted in
the conjoint design. Our own experience confirms that this mode of including the outside option
greatly increases the selection of the outside option. Our experience has also been that the
traditional method of including the outside option often elicits a very low rate of selection which
we view as unrealistic. The advocates of the “dual response” method argue that the method helps
to reduce a conjoint survey bias toward higher purchase rates than in the actual marketplace.
Another way of reducing bias toward higher purchase rates is to design a conjoint using an
“incentive-compatible” scheme in which the conjoint responses have real monetary
consequences. There are a number of ways to do this (see, for example, Ding et al., 2005) but
most suggestions (an interesting exception is Dong et al., 2010) use some sort of actual product
and a monetary allotment. If the products in the study are actual products in the marketplace,
then the respondent might actually receive the product chosen (or, perhaps, be eligible for a
lottery which would award the product with some probability). If the respondent selects the
outside option, they would receive a cash transfer (or equivalent lottery eligibility).
351
4.3 Estimating Price Sensitivity
Both WTP and equilibrium prices are sensitive to inferences regarding the price coefficient.
If the distribution of prices puts any mass at all on positive values, then there does not exist a
finite equilibrium price. All firms will raise prices infinitely, effectively firing all consumers with
negative price sensitivity and make infinite profits on the segment with positive price sensitivity.
Most investigators regard positive price coefficients as inconsistent with rational behavior.
However, it will be very difficult for a normal model to drive the mass over the positive half line
for price sensitivity to a negligible quantity if there is mass near zero on the negative side.
We must distinguish uncertainty in posterior inference from irrational behavior. If a number
of respondents have posteriors for price coefficients that put most mass on positive values, this
suggests a design error in the conjoint study; perhaps, respondents are using price an a proxy for
the quality of omitted features and ignoring the “all other things equals” survey instructions. In
this case, the conjoint data should be discarded and the study re-designed. On the other hand, we
find considerable mass on positive values simply because of the normal assumption and the fact
that we have very little information about each respondent. In these situations, we have found it
helpful to change the prior or random effect distribution to impose a sign constraint on the price
coefficient.
In many conjoint studies, the goal is to simulate market shares for some set of products.
Market shares can be relatively insensitive to the distribution of the price coefficients when
prices are fixed to values typically encountered in the marketplace. It is only when one considers
relative prices that are unusual or relatively high or low prices that the implications of a
distribution of price sensitivity will be felt. By definition, price optimization will stress-test the
conjoint exercise by considering prices outside the small range usually considered in market
simulators. For this reason, the quality standards for design and analysis of conjoint data have to
be much higher when used from economic valuation than for many of the typical uses for
conjoint. Unless the distribution of price sensitivity puts little mass near zero, the conjoint data
will not be useful for economic valuation using either our equilibrium approach or for the use of
the more traditional and flawed p-WTP methods.
5. ILLUSTRATION
To illustrate our proposed method for economic valuation and to contrast our method with
standard p-WTP methods, we consider the example of the digital camera market. We designed a
conjoint survey to estimate the demand for features in the point and shoot submarket. We
considered the following seven features with associated levels:
1.
2.
3.
4.
5.
6.
7.
Brand: Canon, Sony, Nikon, Panasonic
Pixels: 10, 16 mega-pixels
Zoom: 4x, 10x optical
Video: HD (720p), Full HD (1080p) and mic
Swivel Screen: No, Yes
WiFi: No, Yes
Price: $79–279
We focused on evaluating the economic value of the swivel screen feature which is illustrated
in Figure 2. The conjoint design was a standard fractional factorial design in which each
352
respondent viewed sixteen choice sets, each of which featured four hypothetical products. A dual
response mode was used to incorporate the outside option. Respondents were first asked which
of the four profiles presented in each choice task was most preferred. Then the respondent was
asked if they would buy the preferred profile at the stated price. If no, then this response is
recorded as the “outside option” or “none of the above.” Respondents were screened to only
those who owned a point and shoot digital camera and who considered themselves to be a major
contributor to the decision to purchase this camera.
Figure 2
Swivel Screen Attribute
Details of the study, its sampling frame, number of respondents and details of estimation are
provided in Allenby et al., 2013. We focus here on some of the important summary findings:
1. The p-WTP measure of the swivel screen attribute is $63.
2. The WTP measure of the swivel screen attribute is $13.
3. The equilibrium change in profits is estimated to be $25.
We find that the p-WTP measure dramatically overstates the economic value of a product
feature, and that the more economic-based measures are more reasonable.
6. CONCLUSION
Valuation of product features is an important part of the development and marketing of new
products as well as the valuation of patents that are related to feature enhancement. We take the
position that the most sensible measure of the economic value of a feature enhancement (either
the addition of a completely new feature or the enhancement of an existing feature) is
incremental profits. That is, we compare the equilibrium outcomes in a marketplace in which one
of the products (corresponding to the focal firm) is feature enhanced with the equilibrium profits
in the same marketplace but where the focal firm’s product is not feature enhanced. This measure
of economic value can be used to make decisions about the development of new features or to
choose between a set of features that could be enhanced. In the patent litigation setting, the value
of the patent as well as the damages that may have occurred due to patent infringement should be
based on an incremental profits concept.
Conjoint studies can play a vital role in feature valuation provided that they are properly
designed, analyzed, and supplemented by information on the competitive and cost structure of
353
the marketplace in which the feature-enhanced product is introduced. Conjoint methods can be
used to develop a demand system but require careful attention to the inclusion of the outside
option and inclusion of the relevant competing brands. Proper negativity constraints must be
used to restrict the price coefficients to negative values. In addition, the Nash equilibrium prices
computed on the basis of the conjoint-constructed demand system are sensitive to the precision
of inference with respect to price sensitivity. This may mean larger and more informative
samples than typically used in conjoint applications today.
We explain why the current practice of using a change in “WTP” as a way valuing a feature
is not a valid measure of economic value. In particular, the calculations done today involving
dividing the part-worths by the price coefficient are not even proper measures of WTP. Current
pseudo-WTP measures have a tendency to overstate the economic value of feature enhancement
as they are only measures of shifts in demand and do not take into account the competitive
response to the feature enhancement. In general, firms competing against the focal featureenhanced product will adjust their prices downward in response to the more formidable
competition afforded by the feature enhanced product. In addition, WTB analyses will also
overstate the effects of feature enhancement on market share or sales as these analyses also do
not take into account the fact that a new equilibrium will prevail in the market after feature
enhancement takes place.
We illustrate our method by an application in the point and shoot digital camera market. We
consider the addition of a swivel screen display to a point and shoot digital camera product. We
designed and fielded a conjoint survey with all of the major brands and other major product
features. Our equilibrium computations show that the economic value of the swivel screen is
substantial and discernible from zero but about one half of the pseudo WTP measure commonly
employed.
Greg Allenby
Peter Rossi
REFERENCES
Allenby, G.M., J.D. Brazell, J.R. Howell and P.E.Rossi (2014) “Economic Valuation of Product
Features,” working paper, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2359003
Berry, S., J. Levinsohn, and A. Pakes (1995): “Automobile Prices in Market Equilibrium,”
Econometrica, 63(4), 841–890.
354
Brazell, J., C. Diener, E. Karniouchina, W. Moore, V. Severin, and P.-F. Uldry (2006): “The NoChoice Option and Dual Response Choice Designs,” Marketing Letters, 17(4), 255–268.
Chintagunta, P. K., and H. Nair (2011): “Discrete-Choice Models of Consumer Demand in
Marketing,” Marketing Science, 30(6), 977–996.
Ding, M., R. Grewal, and J. Liechty (2005): “Incentive-Aligned Conjoint,” Journal of Marketing
Research, 42(1), 67–82.
Dong, S., M. Ding, and J. Huber (2010): “A SimpleMechanism to Incentive-align Conjoint
Experiments,” International Journal of Research in Marketing, 27(25–32).
McFadden, D. L. (1981): “Econometric Models of Probabilistic Choice,” in Structural Analysis
of Discrete Choice, ed. by M. Intrilligator, and Z. Griliches, pp. 1395–1457. North-Holland.
Ofek, E., and V. Srinivasan (2002): “How Much Does the Market Value an Improvement in a
Product Attribute,” Marketing Science, 21(4), 398–411.
Orme, B. K. (2001): “Assessing the Monetary Value of Attribute Levels with Conjoint Analysis,”
Discussion paper, Sawtooth Software, Inc.
Petrin, A. (2002): “Quantifying the Benefits of New Products: The Case of the Minivan,” Journal
of Political Economy, 110(4), 705–729.
Rossi, P. E., G. M. Allenby, and R. E. McCulloch (2005): Bayesian Statistics and Marketing.
John Wiley & Sons.
Sonnier, G., A. Ainslie, and T. Otter (2007): “Heterogeneity Distributions of Willingness-to-Pay
in Choice Models,” Quantitative Marketing and Economics, 5, 313–331.
Trajtenberg, M. (1989): “The Welfare Analysis of Product Innovations, with an Application to
Computed Tomography Scanners,” Journal of Political Economy, 97(2), 444–479.
Uldry, P., V. Severin, and C. Diener (2002): “Using a Dual Response Framework in Choice
Modeling,” in AMA Advanced Research Techniques Forum.
355
THE BALLAD OF BEST AND WORST
TATIANA DYACHENKO
REBECCA WALKER NAYLOR
GREG ALLENBY
OHIO STATE UNIVERSITY
“Best is best and worst is worst,
and never the twain shall meet
Till both are brought to one accord
in a model that’s hard to beat.”
(with apologies to Rudyard Kipling)
In this paper, we investigate the psychological processes underlying the Best-Worst choice
procedure. We find evidence for sequential evaluation in Best-Worst tasks that is accompanied
by sequence scaling and question framing scaling effects. We propose a model that accounts for
these effects and show superiority of our model over currently used models of single evaluation.
INTRODUCTION
Researchers in marketing are constantly developing tools and methodologies to improve the
quality of inferences about consumer preferences. Examples include the use of hierarchical
models of marketplace data and the development of novel ways of data collection in surveys. An
important aspect of this research involves validating and testing models that have been proposed.
In this paper, we take a look at one of these relatively new methods called “Maximum
Difference Scaling,” also known as Best-Worst choice tasks. This method was proposed (Finn &
Louviere, 1992) to address the concern that insufficient information is collected for each
individual respondent in discrete choices experiments. In Best-Worst choice tasks, each
respondent is asked to make two selections: select the best, or most preferred, alternative and the
worst, least preferred, alternative from a list of items. Thus, the tool allows the researcher to
collect twice as many responses on the same number of choices tasks from the same respondent.
This tool has been extensively studied and compared to other tools available to marketing
researchers (Bacon et al., 2007; Wirth, 2010; Marley et al., 2005; Marley et al., 2008). MaxDiff
became popular in practice due its superior performance (Wirth, 2010) compared to traditional
choice-based tasks in which only one response, the best alternative, is collected.
While we applaud the development of a tool that addresses the need for better inferences, we
believe that marketing researchers need to deeply think about analysis of the data coming from
the tool. It is important to understand assumptions that are built into the models that estimate
parameters related to consumer preferences.
The main assumption that underlies current analysis of MaxDiff data is the assumption of
equivalency of the “select-the-best” and “select-the-worst” responses, meaning that two pieces of
information from two response subsets contain the same quality of information that can be
extracted to make inferences about consumer preferences. We can test this assumption of
equivalency of information by performing the following analysis. We can split the Best-Worst
357
data into two sub-datasets—“best” only responses and “worst” only responses. If we assume that
the assumption that respondents rank items from top to bottom is true, then we can run the same
model to obtain inferences of preference parameters in both data subsets. If the assumption of
one-time ranking is true, the inference parameters recovered from these data sets should be
almost the same or close to each other.
Figure 1 shows the findings from performing this analysis on an actual dataset generated by
the Best-Worst task. We plotted the means of estimated preference parameters  from the “selectthe-best” only responses on the horizontal line and from “select-the-worst” on the vertical line. If
the two subsets from the Best-Worst data contained the same quality of information about
preference parameters, , then all points would have been on or close to the 45-degree line.
We see that there are two types of systematic non-equivalence of the two datasets. First, the
datasets differ in the size of the range that parameters spin for Best and Worst responses. This
result is interesting, as it would indicate more consistency in Best responses than in Worst.
Second, there seems to be a possible relationship between the Best and Worst parameters. These
two factors indicate that the current model’s assumption of single, or one-time, evaluation in
Best-Worst tasks should be re-evaluated as the actual data do not support this assumption.
Figure 1. Means of preference parameters estimated from the “select-the-best” subset
(on the horizontal line) and from the “select-the-worst” subset (on the vertical line)
PROPOSED APPROACH
To understand the results presented in Figure 1, we take a deeper look at how the decisions in
the Best-Worst choice tasks are made, that is, we consider possible data-generating mechanisms
in these tasks. To think about these processes, we turned to the psychology literature that presents
358
a vast amount of evidence that would point to the fact that we should not expect the information
collected from the “select-the-best” and “select-the-worst” decisions to be equivalent. This
literature also provides a source of multiple theories that can help drive how we think about and
analyze the data from Best-Worst choice tasks.
In this paper, we present an approach that takes advantage of theories in the psychology
literature. We allow these psychological theories to drive the development of the model
specification for the Best-Worst choice tasks. We incorporate several elements into the
mathematical expression of the model, the presence of which is driven by specific theories.
The first component is related to sequential answering of the questions in the Best-Worst
choice tasks. Sequential evaluation is one of the simplifying mechanisms that we believe is used
by respondents in these tasks. This mechanism allows for two possible sequences of decision:
selecting the best first and then moving to selecting the worst alternative, or answering the
“worst” question first and then choosing the “best” alternative. This is in contrast to the
assumptions of the two current models developed for these tasks: single ranking as we described
above and a pairwise comparison of all items presented in the choice task, where people are
assumed to maximize the distance between the two items.
The sequential decision making that we assume in our model generates two different
conditions under which respondents provide their answers because there are different numbers of
items that people evaluate in the first and the second decision. In the first response, there is a full
list of items from which to make a choice, while the second choice involves a subset of items
with one fewer alternative because the first selected item is excluded from the subsequent
decision making. This possibly changes the difficulty of the tasks as respondents move from the
first to the second choice, making the second decision easier with respect to the number of items
that need to be processed. This change in the task in the second decision is represented in our
model through parameter ψ (order effect), which is expected to be greater than one to reflect that
the second decision is less error prone because it is easier.
Another component of our model deals with the nature of the “select-the-best” and “selectthe-worst” questions. We believe that there is another effect that can be accounted for in our
sequential evaluation model for these choice tasks that cannot be included in any of the singleevaluation models. As discussed above, the “sequential evaluation” means that there are two
questions in Best-Worst tasks: “select-the-best” and “select-the worst” that are answered
sequentially. But these two questions require switching between the mindsets that drive the
responses to two questions. To select the best alternative, a person retrieves experiences and
memories that are congruent with the question at hand—“select-the-best.” The other question,
“select-the-worst” is framed such that another, possibly overlapping, set of memories and
associations are retrieved that is more congruent with that question.
The process of such biased memory retrieval is described in the psychology literature
exploring hypothesis testing theory and the confirmation bias (Snyder, 1981; Hoch and Ha,
1986). This literature suggests that people are more likely to attend to information that is
consistent with a hypothesis at hand. In the Best-Worst choice tasks, the temporary hypothesis in
the “Best” question is “find the best alternative,” so that people are likely to attend mostly to
memories related to the best or most important things that happened to them. The “Worst”
question would generate another hypothesis “select the worst” that people would be trying to
359
confirm. This would create a different mental frame making people think about other, possibly
bad or less important, experiences to answer that question.
The subsets of memories from the two questions might be different or overlap partially. We
believe that there is an overlap and, thus, the differences in preference parameters between the
two questions can be represented by the change in scale. This is a scale parameter λ (question
framing effect) in our model. However, if the retrievals in the two questions are independent and
generate different samples of memories, then a model where we allow for independent
preference β parameters would perform better than the model that only adjusts the scale
parameter.
The third component of the model is related to the error term distribution in models for BestWorst choices tasks. Traditionally, a logit specification that is based on the maximum extreme
value assumption of the error term is used. This is mostly due to the mathematical and
computational convenience of these models: the probability expressions have closed forms and,
hence, the model can be estimated relatively fast.
We, however, want to give the error term distributional assumption serious consideration by
thinking about more appropriate reasons for the use of extreme value (asymmetric) versus
normal (symmetric) distributional assumptions. The question is: can we use the psychology
literature to help us justify the use of one specification versus another? As an example of how it
can be done, we use the theory of episodic versus semantic memory retrieval and processing
(Tulvin, 1972).
When answering Best-Worst questions, people need to summarize the subsets of information
that were just retrieved from memory. If the memories and associations are aggregated by
averaging (or summing) over the episodes and experiences (which would be consistent with a
semantic information processing and retrieval mechanism), then that would be consistent with
the use of the normally distributed (symmetric) error term due to the Central Limit Theorem.
However, if respondents pay attention to specific episodes within these samples of information
looking for the most representative episodes to answer the question at hand (which would be
consistent with an episodic memory processing mechanism), then the extreme value error term
assumption would be justified. This is due to Extreme Value Theory, which says that the
maximum, or minimum, of a random variable is distributed Max, or Min, extreme value. Thus,
in the “select-the-best” decision it is appropriate to use the maximum extreme value error term,
and in the “select-the-worst” question, the minimum extreme value distribution is justified.
Equation 1 is the model for one Best-Worst decision task. This equation shows the model
based on the episodic memory processing, or extreme value error terms. It includes two possible
sequences, that are indexed by parameter θ, order scale parameter ψ in the second decision,
exclusion of the first choice from the set in the second decision, and our question framing scaling
parameter λ. The model with the normal error term assumption has the same conceptual structure
but the choice probabilities have different expressions.
360
Equation 1. Sequential Evaluation Model (logit specification)
This model is a generalized model and includes some existing models as special cases. For
example, if we use the probability weight instead of sequence indicator θ, then under specific
values of that parameter, our model would include the traditional MaxDiff model. The
concordant model by Marley et al. (2005) would also be a special case of our modified model.
EMPIRICAL APPLICATION AND RESULTS
We applied our model to data that was collected from an SSI panel. Respondents went
through 15 choices tasks with five items each as is shown in Figure 2. The items came from a list
of 15 hair care concerns and issues. We analyzed responses from 594 female respondents over 50
years old. This sample of the population is known for high involvement with the hair care
category. For example, in our sample, 65% of respondents expressed some level of involvement
with the category.
Figure 2. Best-Worst task
We estimated our proposed models with and without the proposed effects. We used
Hierarchical Bayesian estimation where preference parameters β, order effect ψ and context
effect λ are heterogeneous. To ensure empirical identification, the latent sequence parameter θ is
estimated as an indicator parameter from Bernoulli distribution and is assumed to be the same for
all respondents. We use standard priors for the parameters of interest.
Table 1 shows the improvement of model fit (log marginal density, Newton-Raftery
estimator) as the result of the presence of each effect, that is, the marginal effect of each model
element. Table 2 shows in-sample and holdout hit probabilities for the Best-Worst pair (random
chance is 0.05).
361
Table 1. Model Fit
Exploded logit (single evaluation)
Context effect only
Order effect only
Context and order effects together
LMD NR
-13,040
-12,455
-11,755
-11,051
These tables show significant improvement in fit from each of the components of the model.
The strongest improvement comes from the order effect, indicating that the sequential
mechanisms we assumed are more plausible given the data than the model with the assumption
of single evaluation. The context effect improves the fit as well, indicating that it is likely that the
two questions, “select-the-best” and “select-the-worst,” are processed differently by respondents.
The model with both effects included into the model is the best model not just with respect to fit
to the data in-sample, but also in terms of holdout performance.
Table 2. Improvement in Model Fit
0.3062
In-sample
Improvement
*
-
0.3168
0.3443
0.3789
3.5 %
12.4%
23.7%
In-sample Hit
Probabilities
Exploded logit (single
evaluation)
Context effect only
Order effect only
Context and order effects
together
0.2173
Holdout
Improvement
*
-
0.2226
0.2356
0.2499
2.4%
8.4%
15.0%
Holdout Hit
Probabilities
* Improvements are calculated over the metric in first line, which comes from the model that
assumes single evaluation (ranking) in Best-Worst tasks.
We found that both error term assumptions (symmetric and asymmetric) are plausible as the
fit of the models are very similar. Based on that finding, we can recommend using our sequential
logit model, as it has computational advantages over the sequential probit model. The remaining
results we present are based on the sequential logit model.
We also found that the presence of dependent preference parameters between the “Best” and
“Worst” questions (question framing scale effect λ) is a better fitting assumption than the
assumption of independence of β’s from the two questions.
From a managerial standpoint, we want to show why it is important to use our sequential
evaluation model instead of single evaluation models. We compared individual preference
parameters from two models: our best performing model and exploded logit specification
(single-evaluation ranking model). Table 3 shows the proportion of respondents for whom a
subset of top items is the same between these two models. For example, for the top 3 items
related to the hair care concerns and issues, the two models agree only for 61% of respondents. If
we take into account the order within these subsets, then the matching proportion drops to 46%.
This means that for more than half of respondents in our study, the findings and
recommendations will be different between the two models. Given the fact that our model of
362
sequential evaluation is a better fitting model, we suggest that the results from single evaluation
models can be misleading for managerial implications and that the results from our sequential
evaluation model should be used.
Table 3. Proportion of respondents matched on top n items of importance between
sequential and single evaluation (exploded logit) models
Top n
items
1
2
3
4
5
6
Proportion of respondents
(Order does not matter)
83.7%
72.1%
61.1%
53.2%
47.0%
37.7%
Proportion of respondents
(Order does matter)
83.7%
65.0%
46.5%
29.1%
18.4%
10.4%
Our sequential evaluation model also provides additional insights about the processes that are
present in Best-Worst choice tasks. First, we found that in these tasks respondents are more likely
to eliminate the worst alternative from the list and then select the best one. This is consistent with
literature that suggests that people, when presented with multiple alternatives, are more likely to
simplify the task by eliminating, or screening-out, some options (Ordonez, 1999; Beach and
Potter, 1992). Given the nature of the task in our application, where respondents had to select the
most important and least important items, it is not surprising that eliminating what is not
important first would be the most likely strategy.
This finding, however, is in contrast with the click data that was collected in these tasks. We
found that about 68% of clicks were best-then-worst. To understand this discrepancy, we added
the observed sequence information into our model by substituting the indicator of latent
sequence θ with the decision order that we observed. Table 4 shows the results of the fit of these
models. The data on observed sequence makes the fit of the model worse. This suggests that
researchers need to be careful when thinking that click data is a good representation of the latent
processes driving consumer decisions in Best-Worst choice tasks.
Table 4. Fit of the models with latent and observed sequence of decisions.
LMD NR
Latent sequence
Observed sequence
-11,051
-12,392
In-sample
Hit
Probabilities
0.3789
0.3210
Holdout Hit
Probabilities
0.2499
0.2247
To investigate order further, we manipulated the order of the decisions by collecting
responses from two groups. One group was forced to select the best alternative first and then
select the worst, and the second group was forced to select in the opposite order. We found that
the fit of our model with the indicator of latent sequence is the same as for the group that was
required to select the worst alternative first. This analysis gives us more confidence in our
363
finding. To understand why the click data seem to be inconsistent with the underlying decision
making processes in respondents’ minds is outside of the scope of this paper, but is an important
topic for future research.
Our model also gives us an opportunity to learn about other effects present in Best-Worst
choice tasks and account for those effects. For instance, there is a difference in the certainty level
between the first and the second decisions. As expected, the second decision is less error prone
than the first. The mean of the posterior distribution of the order effect ψ is greater than one for
almost all respondents. This finding is consistent with our expectation that the decrease in the
difficulty of the task in the second choice will impact the certainty level. While we haven’t
directly tested the impact of the number of items on the list on certainty level, our finding is
expected.
Another effect that we have included in our model is the scale effect of question framing λ,
which represents the level of certainty in the parameters that a researcher obtains between the
best and worst selections as the result of the response elicitation procedure—“best” versus
“worst.” We found that the average of the sample for this parameter is 1.17, which is greater than
one. This means that, on average, respondents in our sample are more consistent in their “worst”
choices. However, we found significant heterogeneity in this parameter among respondents.
To understand what can explain the heterogeneity in this parameter, we performed a postestimation analysis of the context scale parameter as it relates to an individual’s expertise level,
which was also collected in the survey. We found a negative correlation of -0.16 between the
means of the context effect parameter and the level of expertise, meaning that experts are likely
to be more consistent in what is important to them and non-experts are more consistent about
what is not important to them.
We also found a significant negative correlation (-0.20) between the direct measure of the
difficulty of the “select-the-worst” items and the context effect parameter, indicating that if it was
easier to respond to the “select-the-worst” questions, λ was larger which is consistent with our
proposition and expectations.
CONCLUSIONS
In this paper, we proposed a model to analyze data from Best-Worst choice tasks. We showed
how the development of model specification could be driven by theories from the psychology
literature. We took a deep look at how we can think about the possible processes that underlie
decisions in these tasks and how to reflect that in the mathematical representation of the datagenerating mechanism.
We found that our proposed model of sequential evaluation is a better fitting model than the
currently used models of single evaluation. We showed that adding the sequential nature to the
model specification allows other effects to be taken into consideration. We found that the second
decision is more certain than the first decision, but the “worst” decision is, on average, more
certain.
Finally, we demonstrated the managerial implications of the proposed model. Our model that
takes into account psychological processes within Best-Worst choice tasks gives different results
about what is most important to specific respondents. This finding has direct implications for
364
new product development initiatives and understanding the underlying needs and concerns of
customers.
Greg Allenby
REFERENCES
Bacon, L., Lenk, P., Seryakova, K., Veccia, E. (2007) “Making MaxDiff More Informative:
Statistical Data Fusion by Way of Latent Variable Modeling,” Sawtooth Software Conference
Proceedings, 327–343.
Beach, L. R., & Potter, R. E. (1992) “The pre-choice screening of options,” Acta Psychologica,
81(2), 115–126.
Finn, A. & Louviere, J. (1992). “Determining the Appropriate Response to Evidence of Public
Concern: The Case of Food Safety,” Journal of Public Policy & Marketing, Vol. 11, No. 2
(Fall, 1992), 12–25.
Hoch, S. J. and Ha, Y.-W. (1986) “Consumer Learning: Advertising and the Ambiguity of
Product Experience,” Journal of Consumer Research , Vol. 13, 221–233.
Marley, A. A. J. & Louviere, J.J. (2005). “Some Probabilistic Models of Best, Worst, and BestWorst Choices,” Journal of Mathematical Psychology, 49, 464–480.
Marley, A. A. J., Flynn, T.N. & Louviere, J.J. (2008). “Probabilistic Models of Set-dependent and
attribute level Best-Worst Choice,” Journal of Mathematical Psychology, 52, 281–296.
Ordo ez, L. D., Benson, III, L. and Beach, L. R. (1999), “Testing the Compatibility Test: How
Instructions, Accountability, and Anticipated Regret Affect Prechoice Screening of Options,”
Organizational Behavior and Human Decision Processes, Vol. 78, 63–80.
Snyder, M. (1981) “Seek and ye shall find: Testing hypotheses about other people,” in C.
Heiman, E. Higgins and M. Zanna, eds, ‘Social Cognition: The Ontario Symposium on
Personality and Social Psychology,’ Hillside, NJ: Erlbaum, 277–303.
Tulving, E. (1972) “Episodic and Semantic Memory,” in E. Tulving and W. Donaldson, eds,
‘Organization of Memory,’ Academic press, NY and London, pp. 381–402.
365
Wirth, R. (2010) “HB-CBC, HB-Best-Worst_CBC or NO HB at All,” Sawtooth Software
Conference Proceedings, 321–356.
366