MaxDiff - Displayr

Learn More About MaxDiff

Belinda Bailey — Thu, 10 Sep 2020 08:58:58 +0000

This is a guide for everything you need to know about MaxDiff. It covers the “what is?” and the “how to…” of different approaches to the analysis, from preference share to profiling latent classes and finally how to interpret the analysis. There are worked examples, shown in Displayr and R.

Introduction

What is Max Diff? Have you ever needed to know what your customers prefer? MaxDiff (otherwise known as Best-Worst Scaling) quite simply involves respondents indicating the ‘Best’ and the ‘Worst’ options out of a given set so that we can gain an understanding of consumers’ preference choices. This can be from how they feel about specific brands or products and to know which features make a difference and are of value to the consumer.

A Beginners Guide to MaxDiff MaxDiff is a research technique for measuring relative preferences.

How MaxDiff Analysis Works This post explains the basic mechanics of how preferences can be measured using the data collected in a MaxDiff experiment.

DIY MaxDiff

Make MaxDiff a piece of cake Quickly go from experimental design to analysis, to interactive report in one tool. Displayr automates all the painful manual steps and makes it easy to deliver best-in-class results, even for the novice.

The 4 easy steps that’ll make any good researcher proficient at MaxDiff This webinar is for market researchers and consumer insights people who analyze data (from novice to expert).

11 Tips for your own MaxDiff Analysis This post shares 11 tips to help researchers be self-sufficient in MaxDiff analysis.

DIY MaxDiff eBook This ebook will take you from generating experimental designs to conducting advanced Hierarchical Bayes analysis.

Interactive Tutorials

Creating an Experimentation Design for MaxDiff This interactive tutorial demonstrates how to create a MaxDiff Experimental Design

Analyzing MaxDiff Data This interactive tutorial demonstrates the range of purpose-built tools for analyzing the data from MaxDiff experiments available in Displayr.

Experimental Design

How to Create a MaxDiff Experimental Design in Displayr This post describes how you can create and check the design yourself.

Advanced MaxDiff Experimental Designs A MaxDiff experimental design creates multiple lists of alternatives to be shown to respondents in multiple questions.

Pairwise Balanced MaxDiff Designs This post gives some background on how MaxDiff designs are created, including a new method for making multiple version designs that are pairwise balanced.

How To Check Experimental Design This post explains the basic process followed when doing a rough-and-ready check of the experimental design.

Data File Formats

Custom Data Files - Survey Gizmo MaxDiff This QScript makes it possible to analyze Alchemer (formerly Survey Gizmo) MaxDiff data in Displayr.

MaxDiff Data File Layouts There is no standard way of laying out the data from MaxDiff experiments. The following descriptions encapsulate some of the common ways.

Statistical Analysis

How to Analyze MaxDiff Data in Displayr This post discusses a number of options that are available in Displayr for analyzing data from MaxDiff experiments.

Setting Up a MaxDiff Experiment as a Ranking There are some more 'exotic' types of analysis of MaxDiff data where it is useful to have the data set up as a Ranking Structure.

Counting Analysis of MaxDiff Data This post explains how to do Counts analysis of MaxDiff data.

Comparing MaxDiff Models and Creating Ensembles in Displayr There are a variety of different models available in Displayr to perform MaxDiff analysis. This post describes how to easily compare the models. It also demonstrates how to create an ensemble that combines the models and potentially improves prediction accuracy.

The Accuracy of Hierarchical Bayes When the Data Contains Segments This post explores the implications of using Hierarchical Bayes with data that contains segments.

Using Hierarchical Bayes for MaxDiff in Displayr This post describes how to run Hierarchical Bayes for MaxDiff in Displayr, and explain the options and outputs available.

Checking Convergence When Using Hierarchical Bayes for MaxDiff This post discusses technical information about how to check for convergence in a Hierarchical Bayes MaxDiff model.

Comparing Tricked Logit and Rank-Ordered Logit with Ties for MaxDiff This post compares two ways in which MaxDiff data is treated in analyses such as latent class analysis and Hierarchical Bayes.

Using Cross-Validation to Measure MaxDiff Performance This post compares various approaches to analyzing MaxDiff data using a method known as cross-validation.

Comparing MaxDiff Results from Different Packages This post lists the main reasons why you may get different results with different packages.

MaxDiff Mixture Models This post discussed the main mixture models used to analyze the MaxDiff experiments.

Anchored MaxDiff Anchored MaxDiff experiments supplement standard MaxDiff questions with additional questions designed to work out the absolute importance of the attributes.

Case Studies

MaxDiff Analysis in Displayr, a Case Study This case study illustrates an advanced analysis of experimental data in Displayr.

Case Study: MaxDiff - Presidential Traits This case study formed the 4th and final part of the webinar DIY Market Research Dashboards - Building 4 in 40 minutes (webinar).

Commander-in-Chief MaxDiff An alternative to PowerPoint, story-style dashboard showing an analysis of what Americans desire in their Commander-in-Chief.

Learn more

If you can't find something you can always ask the technical support team, who love to help. Just email support@displayr.com. Happy learning!

Comparing MaxDiff Results from Different Packages

Tim Bock — Fri, 22 Feb 2019 02:45:27 +0000

Different models

There are lots of different statistical models that you can use to compute MaxDiff. Some of these get different results from Sawtooth simply because they are wrong. If you are doing counting analysis, aggregate multinomial logit, or aggregate rank-ordered logit models, then you will definitely get a different answer from Sawtooth. In the case of counting analysis, you will get a different answer simply because the technique is wrong. In the case of the other two models, you will get a different answer because these two models assume that everybody is the same, whereas the Sawtooth HB/HBA model assumes that all people are different. The Sawtooth assumption is the better one.

If you are using a latent class analysis model, such as the ones in Q, Displayr, and LatentGold, you will get different answers because these models assume that there are a small number of different segments, whereas Sawtooth HB assumes people lie in a continuum, and this difference can be important. As I discuss in "The accuracy of "Hierarchical Bayes when the Data Contains Segments", the HB model tends to be the safest one, but the smart thing to do when time permits is to run multiple models and compare them.

In the case of MaxDiff models, there is also a difference in terms of how the worst (least preferred) choices are modeled. There are two different approaches: Tricked Logit and Rank-Ordered Logit, and each give slightly different results.

Different respondent-level scores

Now, assuming you are comparing equivalent models (e.g., Displayr's HB with Sawtooth's HB), the next thing to check is that the scores have been computed in the same way for each of the respondents. These numbers can be scaled in many different ways, and any comparison is only meaningful if they have been scaled in the same way.

The main scalings are:

Mean-centered utilities/coefficients/parameters. These will tend to be numbers between -10 and 10, with lots of decimal places. They will average 0. Coefficient and parameter are, in this context, synonyms. Utility is a more vaguely defined term, and can be the same thing as a coefficient/parameter, but may mean one of the other things in this list.
0-Based utilities/coefficients/parameters. These will have one alternative set to 0 for all respondents, with the other utilities relative to this one.
Respondent-level Z-scores. These are mean centered utilities/coefficients/parameters that have been further standardized to have a standard deviation of 1 for each respondent.
0 to 1 scaled utilities/coefficients/parameters. These are utilities/coefficients/parameters scaled to have a minimum value of 0 and a maximum value of 1.
0 to 100 Scaled utilities/coefficients/parameters. These are utilities/coefficients/parameters scaled to have a minimum value of 0 and a maximum value of 100.
Preference shares/Probability %. These are scores that have a minimum of 0 and sum up to either 1 or 100, and are computed by a logit transformation of the coefficients/parameters.
K-alternative preference shares/Probability %. These are scores that have a minimum of 0 and maximum of either 1 or 100, and are computed using a variant of the logit transformation developed by Sawtooth (see this wiki page for more information).

Each of the different scalings means different things and all are sensible in some contexts. If your differences are due to different scalings, the trick is to work out which scaling is appropriate for your end users.

Number of classes

Latent class analysis in general, and HB in Q, Displayr, and bayesm, all permit the specification of multiple classes. If you compare results from models with different number of classes, you should expect differences. The trick is to choose the model with the best predictive validity.

Non-convergence (too few iterations)

All of the modern analysis techniques start by randomly guessing some initial value and then trying to refine them. Each attempt at refining is called an iteration. If the default number of iterations is too small, you should not rely on the results. The way that you work this out is by checking that the model has converged (for theory, see Checking Convergence When Using Hierarchical Bayes for MaxDiff).

Most modern software will give warnings if you have a problem due to convergence. However, Sawtooth does not provide any warning about this (although our experience is that their default settings are OK, so there is probably not a problem here). It is possible to compute the standard convergence diagnostics for Sawtooth by using the monitor function in the rstan R package (which is available in Q and Displayr).

You only want to be using models that have converged. If one of the models has a worse predictive accuracy this could be a sign that it has not converged.

Local optima

As mentioned, the techniques start with an initial guess. Sometimes this initial guess is so poor that it is impossible to get a good result. You can diagnose this by inspecting predictive accuracy. However, a better approach would be to run the models many times and compare them.

Algorithm

Even if you have two models that sound the same, they often will still lead to different results due to decisions that people have made when creating the algorithms. Each of these can lead to differences. Examples include:

The default number of iterations.
How they test for convergence. For example, do they stop when a model can only be improved by 0.0000001% or 0.0000002%?
The estimation method. For example, when fitting latent class analysis, Latent Gold uses bayesian posterior mode estimation whereas Q uses maximum likelihood estimation. When fitting hierarchical Bayes (HB), Q and Displayr use Hamiltonian Monte Carlo, whereas Sawtooth and bayesm use Gibbs Sampling.
Randomization. All the modern algorithms include some form of randomization. As they use different random numbers, they will usually get different answers.

Differences in how you implement your algorithms will guarantee small differences between results.

Ambiguous preferences

MaxDiff experiments usually do not collect a lot of data from each respondent. There is usually no way of determining, with certainty, what a respondents true preferences are for alternatives that were neither loved nor loathed (click here for a demonstration of this problem). This ambiguity of preferences means that you can have two different sets of results and both can be adequate descriptions of the underlying data, much as in the same way that people come up with different explanations for election results. Which is correct? The trick is to choose the one with better predictive accuracy.

Conclusion

There are lots of different reasons why different software packages should give different results. However, ultimately comparison should focus on the empirical side of things rather than the theory:

If you have two sets of different results, you should choose between them based on predictive validity. See Using Cross-Validation to measure MaxDiff Performance for more information about this.
If the results are very similar, but not identical, this should not be a surprise, due to the reasons listed above. If you get two broadly similar set of results you can be pretty confident that your results are probably not due to local optima or convergence issues, so that is good news!

Comparing MaxDiff Models and Creating Ensembles in Displayr

Jake Hoare — Thu, 04 Oct 2018 18:00:54 +0000

Types of MaxDiff model

There are two main categories of MaxDiff model: hierarchical Bayes and latent class. Within these categories, models are further specified by other parameters such as the number of classes. We frequently want to experiment with a variety of different models in order to find the most accurate.

To illustrate the comparison, we are going use 1 and 3 class hierarchical Bayes models as well as a 1 class latent class model. This post describes how to set up a MaxDiff model in Displayr. I'll also be using the technology data described in that post. For each model we leave out 2 questions during the fitting process. The prediction accuracy for the 2 questions provides and unbiased estimate of accuracy (compared to the accuracy from the questions used for fitting). The output for the 1 class hierarchical Bayes model is below.

Comparing models

To create a table comparing several models, navigate to Insert > More > Marketing > MaxDiff > Ensemble. Then drag models into the Input models box, or select them from the drop-down list.

If you don't tick the Ensemble box, Displayr will create a table that just compares the models. When this is the case, it is not necessary that the models use the same underlying data. If you do check the Ensemble box then Displayr creates an additional model. This requires that the underlying models all use the same data.

The table for my 3 models is as follows. The best values for each measure are shaded in dark blue and the worst are shaded in light blue.

We can see that the 1 class hierarchical Bayes model performs the best in terms of accuracy on the holdout questions. It also has superior BIC and log-likelihood metrics (which are measures of goodness of fit).

How models are combined in an ensemble

To create an ensemble, we use the respondent utilities (also known as coefficients or parameters). I provide a brief overview here but this post describes more about MaxDiff.

Utilities are a measure of how much each respondent prefers each alternative.
The models fits (i.e. estimates) these utilities from the responses to questions.
The preference of a respondent for an alternative is calculated as e raised to the power of the utility.
The probability that the respondent will chose a specific alternative is given by the the ratio of the preference for that alternative to the sum of preferences of all possible alternatives.

The table below shows the utilities for the first 10 respondents. Apple, Google and Samsung tend have high utilities, so are the preferred alternatives.

The ensemble is created by averaging utility tables across the models.

Why ensembles can improve accuracy

We can see from the earlier table that the ensemble has a superior out-of-sample prediction accuracy to each of the 3 underlying models. Since the ensemble is created by averaging, it may be surprising that the ensemble accuracy isn't just the average accuracy.

Tho understand this effect, imagine if you know nothing about tennis (maybe you don't need to imagine!) and asked one person "Who is the best male tennis player in the world?". They reply "Roger Federer". Depending on how much you think that person knows, you will trust their answer to a certain degree. Now you ask the same question to another 99 people. If their answers all generally agree, you can be more confident that Roger really is the best. If you get a mixture of responses including Rafael Nadal and Novak Djokovic then you would not be so sure who will win the next grand slam tournament.

Ensembles work in a similar manner. Each model makes predictions and some models will be better than others at predicting in a specific situation. By taking the average utilities we reduce the noise from individual models (noise is technically known as variance in this situation).

It's also important to consider model correlation. If the models are very similar then the benefit from averaging will be small. In the extreme case of identical models, each additional model brings nothing new and there is no increase in accuracy. If the models are diverse and each is a good predictor in different situations, then the increase in accuracy is large. In practice the models are similar, so the benefit is small but potentially tangible enough that the winners of prediction competitions almost always use ensembles.

Ensemble parameter histograms

By setting Output to Ensemble we can visualize the respondent utility distributions in the same manner as for the underlying models. I've shown this below.
We can also use Insert > More > Marketing > MaxDiff > Save Variable(s) to add the coefficients, preference shares or proportion of correct predictions to the data set.

Read more about market research, or try this analysis yourself! The flipMaxDiff R package, which uses the rstan package, creates the hierarchical Bayes models and ensemble.

What is MaxDiff?

Tim Bock — Sat, 11 Aug 2018 13:00:10 +0000

Download our free MaxDiff ebook

Example of a MaxDiff question

A MaxDiff study involves presenting a sample of respondents with a series of questions, in which each question contains a list of alternatives. Respondents choose which alternative they like the most (best) and which the least (worst). The list of alternatives changes from question to question. I've provided an example, below.

Why do people use MaxDiff?

MaxDiff is used to resolve two practical problems with traditional rating scales:

Poor discrimination between alternatives, with respondents in surveys, often rating multiple alternatives as very important, or 10, on a 10-point scale
Yeah-saying biases, which are a type of response bias, whereby some respondents typically give much higher ratings than others

Consider the problem of working out what capabilities people would most like in the President of the United States. Asking people to rate the importance of each of the following characteristics would likely not be very useful. We all want a decent and ethical president. But we also want a president who is healthy. And the President needs to be good in a crisis.

We would end up with a whole lot of people rating the capabilities as 10 out of 10 for importance. Some people may give an average rating of 9, whereas others may give an average rating of 5, just because they differ in terms of how strongly they like to state things. MaxDiff is ideal in these kinds of situations.

Overview of the process when doing a MaxDiff study

There are five stages in a MaxDiff study:

Creating a list of alternatives
Creating an experimental design
Collecting the data
Statistical analysis
Reporting

Outputs from a MaxDiff study

The end-point of a MaxDiff study is usually one or both of the following:

A ranking of alternatives in order of preference. For example, if the study is being used for product-concept testing, the goal is to work out the relative appeal of the concepts.
An understanding of differences between people in terms of their preferences for the alternatives. For example, a study examining preferences for product attributes may be designed as an input to a segmentation exercise, looking to find segments of people with different preferences.

Want to know more about MaxDiff please go to our Beginners Guide to MaxDiff? Head on over to the Displayr blog and check out more important market research topics!

How to use Covariates to Improve your MaxDiff Model

Mathew McLean — Fri, 27 Jul 2018 13:00:48 +0000

Create your own MaxDiff Design

MaxDiff is a type of best-worst scaling. Respondents are asked to compare all choices in a given set and pick their best and worse (or most and least favorite). For an introduction, check out this great webinar by Tim Bock. In our post, we'll discuss why you may want to include covariates in the first place and how they can be included in Hierarchical Bayes (HB) MaxDiff. Then we'll use the approach to examine the qualities voters look for in a U.S. president.

Why include respondent-specific covariates?

Advances in computing have made it simple to include complex respondent-specific covariates in HB MaxDiff models. There are several reasons why we may want to do this in practice.

A standard model which assumes each respondent's part-worth is drawn from the same normal distribution may be too simplistic. Information drawn from additional covariates may improve the estimates of the part-worths. This is likely to be the case for surveys in which there were fewer questions and therefore less information.
Additionally, when respondents are segmented, we may be worried that the estimates for one segment are biased. Another concern is that HB may shrink the segment means overly close to each other. This is especially problematic if sample sizes vary greatly between segments.

How to include covariates in the model

In the usual HB model, we model the part-worths for the ith respondent as β_i ~ N(μ, ∑). Note that the mean and covariance parameters μ and ∑ do not depend on i and are the same for each respondent in the population. The simplest way to include respondent-specific covariates in the model is to modify μ to be dependent on the respondent's covariates.

We do this by modifying the model for the part-worths to β_i ~N(Θx_i, ∑) where x_i is a vector of known covariate values for the ith respondent and Θ is a matrix of unknown regression coefficients. Each row of Θ is given a multivariate normal prior. The covariance matrix, ∑, is re-expressed into two parts: a correlation matrix and a vector of scales, and each part receives its own prior distribution.

Fitting covariates in Q and Displayr

This model can be fit in Q and Displayr, which uses the No-U-Turn sampler from stan - the state-of-the-art software for fitting Bayesian models. The package allows us to quickly and efficiently estimate our model without having to worry about selecting the tuning parameters that are frequently a major hassle in Bayesian computation and machine learning. The package also provides a number of features for visualizing the results and diagnosing any issues with the model fit.

Download our free MaxDiff ebook

Example in Displayr

The dataset

Our data set asked 315 Americans ten questions about the attributes they look for in a U.S. president. Each question asked the respondents to pick their most and least important attributes from a set of five. The attributes were:

Decent/ethical
Plain-speaking
Healthy
Successful in business
Good in a crisis
Experienced in government
Concerned for the welfare of minorities
Understands economics
Concerned about global warming
Concerned about poverty
Has served in the military
Multilingual
Entertaining
Male
From a traditional American background
Christian

For more information, please see this earlier blog post, which analyzes the same data using HB, but does not consider covariates.

Fitting your MaxDiff Model

In Displayr and Q, we can fit a MaxDiff model by selecting Marketing > MaxDiff > Hierarchical Bayes from the menu (Anything > Advanced Analysis in Displayr and Create in Q). See this earlier blog post for a description of the HB controls/inputs and a demo using a different data set. Documentation specific to the Displayr GUI is on the Q wiki.

We then included a single categorical predictor in the model - responses to the question of who they voted for in the 2016 election. The predictor had the following levels; voted for Clinton, voted for Trump, voted for another candidate, didn't vote and don't know or refused to answer.

We would expect this predictor to have a very strong correlation with the best and worse choices for each respondent. To compare the models with and without covariates in Displayr, first fit the model without covariates and then copy/paste the created R item.

To add the covariates, select them from the dropbox labeled "Covariates" under the section MODEL in the Object Inspector of your copied HB output.

Create your own MaxDiff Design

Checking convergence

We fit the models using 1000 iterations and eight Markov chains. When conducting a HB analysis, it is vital to check that the algorithm used has both converged to and adequately sampled from the posterior distribution. Using the HB diagnostics available in Displayr (see this post for a detailed overview), there appeared to be no issues with convergence for this data. We then assessed the performance of our models by leaving out one or more respondent questions and seeing how well we could predict their choice using the estimated model.

Results

If we only hold out one question for prediction and use the other nine questions to fit the models, the effect of the categorical predictor is small. The model with the categorical predictor takes longer to run for the same number of iterations due to the increased number of parameters. Both models have only a modest improvement in out-of-sample prediction accuracy (from 67.0% to 67.4%). We did not gain much from running the predictor because we could already draw substantial information from the nine MaxDiff questions.

Including fixed covariates becomes much more advantageous when you have less MaxDiff questions - like in the extreme example of only having two questions to fit the models. We see a larger improvement in out-of-sample prediction accuracy (from 54.5% to 55.0%). We also see a much higher effective sample size per second. This means that the algorithm is able to sample much more efficiently with the covariate included. Even more importantly, this saves us time as we don't need to use as many iterations to obtain our desired number of effective samples.

Download our free MaxDiff ebook

Ready to include your own covariates for analysis?

Create your own MaxDiff Design

The Accuracy of Hierarchical Bayes When the Data Contains Segments

Tim Bock — Wed, 25 Jul 2018 23:00:03 +0000

A simulation involving two segments

To explore this problem I generated some simulated data for 200 fake respondents. I used a MaxDiff experiment with 10 alternatives (A, B, ..., J) and 2 segments (75% and 25% in size). One segment was created to prefer the alternatives in order of A > B > ... > J with coefficients of 4.5, 3.5., ..., -4.5. The second segment had the coefficients in the reverse order.

The models

I estimated six models:

A multinomial logit model (using Tricked Logit for this and all models, to deal with the worst choices).
A latent class logit model with 2 classes (i.e., latent class analysis with multinomial logit estimated in each class).
A latent class logit model with 3 classes.
A standard HB model.
A 2-class HB model. This is a hybrid of latent class analysis and HB, where the model assumes that the data contains two segments, where each segment has its own HB model.
A 3-class HB model.

Predictive accuracy of the models

The experimental design contained 6 questions with 5 options in each. I used 4 of the questions, randomly selected for each respondent, to fit the models, using the remaining 2 for cross-validation purposes. The predictive accuracy from the cross-validation is shown below. In most regards these results are as we would expect. The latent class logit with 2 classes outperforms the standard HB model. So does the 2-class HB model. Both three class models perform the same. The one result that was really surprising to me is that the standard HB model does surprisingly well.

Number of classes	Latent Class Analysis	Hierarchical Bayes
1	62.3%	78.5%
2	82.0%	82.0%
3	82.0%	82.0%

Investigating the standard HB model

To better understand the performance of the standard HB model I computed the individual-level coefficients and formed them into two segments using K-means cluster analysis. It perfectly recovered the two segments - that is, each person was classified into the correct segment. This is obviously very good news, as it suggests that from the perspective of forming segments, we can achieve this result with HB, even when latent class analysis is the theoretically better model.

Nevertheless, the table above shows that the standard HB model (first row, last column) has worse predictive accuracy than latent class analysis (second row, second column). As mentioned above, I simulated the data with coefficients of -4.5, -3.5, -2.5, ..., 4.5, and the reverse in the second segment. The chart below shows the estimated averages for each segment from the standard HB model. HB has correctly recovered the relative order of the preferences, but the average coefficients are incorrect. They are, by and large, more extreme. For example, alternative J is estimated as having a coefficient of 7.7 and -7.1, whereas the correct values are -4.5 and 4.5. (By contrast, the values estimated for the latent class analysis, which are not shown, were almost identical to the simulated values, as we would expect given that the data was generated under the assumption of latent classes).

A second problem emerges when when we look at the distributions of the individual-level coefficients. Below, to the left, I have shown density plots of the distributions from the latent class analysis for people in the first segment. On the right I show the distributions as estimated from the standard HB for the same respondents. In addition to the means being further from 0, the HB is estimating a lot of variation within the segments, and this is largely incorrect. For example, for alternative A, shown at the bottom, the latent class analysis estimates a value of -4.4 relative to the true value of -4.5, whereas the HB model incorrectly indicates that there is variation within the segment from around -11 to 0, with a median of about -7.

If the only focus is creating segments, this difference is pretty trivial in this example, but this would not always be the case. If the individual-level coefficients are used for other purposes, such as correlating with other data or computing preference shares, these errors become more important.

Multi-class HB is better

As shown earlier, the 2 and 3-class HB models had the same predictive accuracy as the 2-class Latent Class model. The chart compares the estimated individual-level coefficients with those from the Latent Class Analysis. The 2-class HB model is not quite as good as the Latent Class Analysis model, but the differences are small (particularly compared to the standard HB).

Implications

The simulation that I describe in this post shows that if the data does contain segments, latent class analysis will do a better job than the one-class HB models that most MaxDiff practitioners use. This supports the conventional wisdom about the different strengths of these models. However, in real-world situations we do not always know whether the data truly contains segments. As a result, the best strategy is generally to compare multiple models based on predictive accuracy. The best model in general seems likely to be an HB model with multiple classes, as this has the flexibility to be useful regardless of whether segmentation exists or not.

See how we conducted this analysis in Displayr, or to find out more, head to the visualization page of our blog!

Creating Pairwise Balanced MaxDiff Designs

Jake Hoare — Mon, 09 Jul 2018 04:02:59 +0000

Create your own MaxDiff Design

Creating single version designs

These earlier posts describe how to create MaxDiff experimental designs in Displayr, Q and with R. They also give some guidelines on how to set the numbers of questions and alternatives per question, as well as advice on interpreting designs.

The standard method used to create designs aims to maximize the amount of information that can be extracted from responses. This naturally involves showing each alternative approximately the same number of times and showing each pair of alternatives together approximately the same number of times. Experimental design is a relatively complex topic, but fortunately packaged algorithms do the hard work for us.

As an example, I show below a design with 10 alternatives, 5 alternatives per question and 6 questions.

This design has a single version - each respondent is asked the same questions, selecting the next worst from the same sets of alternatives.

The design can also be described as a binary matrix, where the presence of a 1 indicates which alternatives are shown in each question. The binary matrix for the design above is shown below.

Provided the guidelines for producing a good design are adhered to, this procedure usually makes an excellent single version design. In rare cases of many alternatives, you could find it advantageous to increase the number of repeats.

Creating multiple version designs

A simple procedure to create another version is to randomly swap the columns of the binary design. Below we can see that the first column of the original version has moved to the seventh column of the new version. Whenever alternative 1 appeared in the original version, we now show alternative 7 in the second version of the design.

You can repeat this process for as many versions are as required. It has the advantage that it preserves the distribution of the frequencies of occurrences of the alternatives within each version. This means that, since each alternative appeared 3 times in the original design, each alternative appears 3 times in all other versions. The same preservation is true of the distribution of pairwise occurrences of alternatives within questions for each version.

However, one drawback is that the across many versions the distribution of pairwise occurrences may become imbalanced. This is a consequence of randomly permuting the alternatives. From the pairwise occurrences across 100 versions shown below we can see that alternative 5 occurs with alternative 6 143 times but alternatives 2 and 5 co-occur only 124 times.

This imbalance does not usually cause any problems if using Hierarchical Bayes to analyse experimental results. However it makes the results of counting analysis more difficult to interpret because alternatives have not been shown with each other the same number of times.

Creating pairwise balanced designs

An alternative strategy for extending a design to multiple versions is to attempt to maintain pairwise balance. With this method, the design is incremented one version at a time. Many randomly permuted candidate versions are considered. The candidate version that creates an overall design with the least imbalance of pairwise frequencies is chosen.

In Displayr and Q this achieved automatically. In R the balanced.versions argument for the MaxDiffDesign function in the flipMaxDiff package is set to TRUE.

The resulting pairwise frequencies for 100 versions are shown below. Note that the variation is now only between 133 and 134 - dramatically less than the variation when the design is not pairwise balanced.

Create your own MaxDiff Design

Assessing the difference between randomly permuted and pairwise balanced designs

In order to highlight the differences between randomly permuted and pairwise balanced designs, it is useful to compare some summary statistics. The math isn't complex, but there are a few steps to each calculation, which I explain below.

Mean version balance. The balance of a version is the sum of the absolute differences between the alternative frequencies and their mean. I have illustrated this via the example below. The table shows the counts of each alternative within the version. The mean count is 12, so the sum of the absolute differences from 12 is 2 + 3 + 1 + 2 + 4 = 12. A perfectly balanced version would score a zero.The higher the value, the more imbalanced the design.

To calculate the mean version balance, these balances are averaged across all versions. They resulting mean is standardized so perfect balance has a score of one and the worst possible design (where each question repeats the same set of alternatives) has a score of zero.

Mean version pairwise balance. This is an analogous calculation to the mean version balance, except for the pairwise counts of how many times alternatives are shown together within each version. It is also averaged across versions and scaled so one is perfection and zero is the worst possible.

Across version balance and Across version pairwise balance. You can calculate this the same way as above, the only difference is counting across all version. This provides an assessment of how balanced the total design is, but not about the variation between versions.

Results

Calculating the above four values for the 100 version random and pairwise balanced designs shows that mean version balance and across version balance are both 1 for both designs. This is unsurprising since the single version design showed each alternative exactly 3 times. Both measures and both methods of creating multiple versions retain the balance.

Mean version pairwise balance is 0.786 for both methods. Although this may appear disappointingly low, there are 60 pairs shown in each version allocated across 45 possible pairs. Since 60 is not divisible by 45 we can never achieve perfect pairwise balance in a single version. In fact 0.786 is the best possible.

Across version pairwise balance reveals the improvement of the pairwise balance method with a score of 0.998 compared to 0.984 for random versions.

Conclusion

Pairwise balance provides an improved method for producing a multi-version MaxDiff design. The summary statistics introduced above (values on a scale of zero to one) allow easy comparison of designs. Balanced designs can be easier to work with when performing a simple counting analysis of MaxDiff experiments, but are generally unnecessary if using more sophisticated methods such as Hierarchical Bayes.

Try it yourself

The examples in this post use Displayr as a front-end to running the R code. If you go into our example document, you can see the outputs for yourself. The code that has been used to generate each of the outputs is accessible by selecting the output and clicking Properties > R CODE on the right hand side of the screen. Designs are created with our own flipMaxDiff package, within which D-efficient designs use the AlgDesign package.

Create your own MaxDiff Design

Download our free MaxDiff ebook

Checking Convergence When Using Hierarchical Bayes for MaxDiff

Justin Yap — Mon, 15 Jan 2018 16:45:32 +0000

Please read Using Hierarchical Bayes for MaxDiff in Q, or Using Hierarchical Bayes for MaxDiff in Displayr, prior to reading this post.

Download our free MaxDiff ebook

Technical overview

Hierarchical Bayes for MaxDiff models individual respondent utilities as parameters (usually denoted beta) with a multivariate normal (prior) distribution. The mean and covariance matrix of this distribution are themselves parameters to be estimated (this is the source of the term hierarchical in the name). Hierarchical Bayes uses a technique called Markov Chain Monte Carlo (MCMC) to estimate the parameters, which involves running a number of iterations where estimates of the parameters are generated at each iteration. This iterative sampling of parameters forms what is known as a Markov Chain. In this post I shall use the term sample to refer to a set of estimates of the parameters from a single iteration.

Different software packages have different approaches. The R package, Stan, (used by Q and Displayr) uses a modern form of MCMC called Hamiltonian Monte Carlo. Sawtooth, on the other hand, uses the more old-school Gibbs sampling. My experience is that both approaches get the same answer, but the newer Hamiltonian Monte Carlo is faster for really big problems with lots of parameters. If you wish to use Gibbs Sampling in R, you can do so using the bayesm package. However, my experience is that the resulting models have a worse fit than those from Stan and from Sawtooth.

The samples are generated from those in the previous iteration using a set of rules. The rules are such that once sufficient iterations have been run and the initial samples are discarded, the distribution of the samples matches the posterior distribution of the parameters given prior distributions and observations. In the case of MaxDiff, the observations are the respondent's best and worst choices to the options presented to them. By default, Stan discards the first half of the samples. This is known as the warm-up. The latter half of the samples are used to estimate the means and standard errors of the parameters.

The difficult part is knowing how many iterations is sufficient to reach convergence, which I discuss in the next section.

Create your own MaxDiff Design

Multiple chains

To take advantage of multi-core processors, multiple Markov chains are run on separate cores in parallel. This has the effect of multiplying the sample size, in return for a slightly longer computation time compared to running a single chain. The more samples, the less sampling error in the results. (Here, I am referring to sampling error that results from the algorithm, which is in addition to sampling error from selection of the respondents). In addition, multiple chains result in samples that are less autocorrelated and less likely to be concentrated around a local optima. Having said that, hierarchical Bayes is, in general, less susceptible to local optima than traditional optimization methods due to the use of Monte Carlo methods.

To make full use of computational resources, I recommend that the number of chains is chosen to be equal to or a multiple of the number of logical cores of the machine on which the analysis is run. In Q and Displayr the default of 8 chains is ideal. If you are running your model by R code, you can use detectCores() from the parallel R package. It is important to run multiple chains for the diagnostics discussed in the next section.

Achieving convergence

Formally, a chain is considered to have converged when the sampler reaches stationarity, which is when all samples (excluding the initial samples) have the same distribution. In practice, heuristics and diagnostics plots are used to ascertain the number of iterations required for convergence. The heuristics are based upon two statistics, n_eff and Rhat, which are shown in the parameter statistics output below:

This table includes statistics for the class sizes, estimated means and standard deviations. The off-diagonal covariance parameters are not shown due to the lack of space, and because they are not as important. n_eff is an estimate of the effective sample size (samples of the parameters, not cases). The smaller n_eff, the greater the uncertainty associated with the corresponding parameter. Thus, in the table above, we can see that all the sigma parameters (the standard deviations) tend to have more uncertainty associated with them than the means (this is typical).

The column se_mean shows the standard error of the parameter means, which is computed as sd/sqrt(n_eff) , where sd is the standard deviation of the parameter.

Rhat refers to the potential scale reduction statistic, also known as the Gelman-Rubin statistic. This statistic is (roughly) the ratio of the variance of a parameter when the data is pooled across all of the chains to the within-chain variance. Thus, it measures the extent to which chains are reaching different conclusions. The further the value of the statistic from 1, the worse.

As a strategy to achieve convergence, I suggest starting with 100 iterations and set the number of chains equal to the number of logical cores available. The four conditions to check for convergence are:

No warning messages should appear. Most warning messages are due to insufficient iterations. However if a warning appears indicating that the maximum tree depth has been exceeded, the maximum tree depth setting should be increased from the default of 10 until the warning goes away.
The estimate of the effective sample size, n_eff, is at least 50 for all values and ideally 100 or more for parameters of interest. A value of 100 is equivalent to specifying that the standard error of the mean is at least an order of magnitude (10 times) less than the standard deviation.
The potential scale reduction statistic, Rhat, should be less than 1.05 and greater than 0.9 for the parameters of interest.
Diagnostics plots should not have any unusual features (discussed below).

If any of these conditions are not met, the analysis should be re-run with double the number of iterations, until all conditions are met. Increasing the iterations beyond this point will increase the precision of the estimates but not drastically change the results. I find that the standard deviation parameters take more iterations to reach convergence than mean parameters. Also, the effective sample size condition tends to be a stricter condition than the one on the potential scale reduction statistic Rhat, so it will be the last to be satisfied.

Refer to the Stan website, and in particular the Stan Modeling Language: User's Guide and Reference Manual for more information.

Trace plots

Trace plots show the parameter samples for each chain. The plots below are for means and standard deviations from an example with 500 iterations and 2 chains. The grey halves indicate the warm-up iterations, whereas the second halves of each plot contain the samples that are used to compute the final result. The following are features to look out for that would indicate an issue with the sampling:

A high degree of correlation between parameters. This would manifest as two or more traces moving in sync.
A parameter has not stabilized by the end of the warm-up iterations.
For a particular parameter, there is a chain which is consistently higher or lower than the others.

A practical challenge with these plots on modern machines with more than 2 chains is that often it can be hard to see patterns because all the lines overlap so much.

The above example shows traces which look to converge fairly well. Consider the traces which come from reducing the number of iterations drastically to 100:

Here we see that 100 iterations is not sufficient. Many of the lines have not stabilized after the warm-up stage.

Posterior Intervals

Posterior interval plots show the range of the sampled parameters. The black dot corresponds to the median, while the red line represents the 80% interval, and the thin black line is the 95% interval. There is nothing out of the ordinary in this plot, but I would be concerned if intervals were wide or if the median was off-center.

Summary

I have provided a brief summary of how hierarchical Bayes for MaxDiff works, explained some key diagnostic outputs, and outlined a strategy to ensure convergence. The challenge with hierarchical Bayes is that unlike methods such as latent class analysis, the user needs to verify that the number of iterations is sufficient before proceeding to make any inferences from the results. The convergence criteria listed in this blog post provides a way to ensure the iterations are sufficient, and give us confidence in the answers we obtain. I will show in future blog posts that hierarchical Bayes performs much better than latent class analysis on out-of-sample individual predictions. To see the R code used to run hierarchical Bayes and generate the outputs in this blog post, see this Displayr document.

Create your own MaxDiff Design

Acknowledgements

This feature was made possible by the R package rstan, as well as the extremely helpful answers to my questions by Stijn Masschelein and Bob Carpenter on the Stan community forum, and the post https://www.r-bloggers.com/stancon-2018-highlights/.

Using Hierarchical Bayes for MaxDiff in Displayr

Justin Yap — Mon, 15 Jan 2018 16:43:19 +0000

Try your own MaxDiff Hierarchical Bayes

Getting started

Your MaxDiff data needs to be in the same format as the technology companies dataset used in previous blog posts on MaxDiff such as this one. To start a new Hierarchical Bayes analysis, click Insert > More > Marketing > MaxDiff > Hierarchical Bayes. Many options in the object inspector on the right are identical to those of latent class analysis and I shall not explain them here. Separate sections below describe the remaining options specific to Hierarchical Bayes.

Number of Classes

This parameter controls the complexity of the model. If the data set contains discrete people, these segments may be missed if the number of classes is set to 1. A more complex model, which is one with more classes, is more flexible but takes longer to fit and may not necessarily provide better performance. If investigating more than one class, it is advisable to ensure it has better predictive accuracy than the one class solution (via cross-validation, discussed below).

If comparing results with Sawtooth, set this to 1: the number used in all Sawtooth models.

Try your own MaxDiff Hierarchical Bayes

Iterations

This option controls how long the analysis runs for. More iterations result in a longer computation time but often leads to better results. When using fewer iterations, the possibility exists that the model returns premature results. In addition, warning messages may appear about divergent transitions or that the Bayesian Fraction of Missing Information is low. However, the absence of warning messages does not mean that the number of iterations is sufficient. I discuss this in more detail below.

Chains

This option specifies how many separate chains (independent analyses) to run, where the chains run in parallel, given the availability of multiple cores. Increasing the number of chains increases the quality of the results. It does, however, result in a longer running time if chains are queued up. Therefore I suggest leaving the option at its default value of eight. This makes full use of the 8 cores available when running R code in Dispalyr.

Maximum tree depth

This is a very technical option, and I am not even going to try and explain it. The practical implication is that this option should only need to be changed if a warning appears indicating that the maximum tree depth has been exceeded. The default value is 10 and a warning should not appear under most circumstances. If such a warning does appear, I suggest first increasing the maximum tree depth to 12 rather than a larger number, as this could increase computation time significantly.

Download our free MaxDiff ebook

Output

The output of Hierarchical Bayes is a table with histograms showing the distribution of respondent coefficients. In the next few sections I describe other types of outputs created from the Insert > More > Marketing > MaxDiff > Diagnostic menu. The coefficients are scaled to have a mean of 0 across brands (i.e., the brand with the highest mean, in this case, Apple, is the most preferred brand). We can also see that some people like it less than average (which is shown by the proportion of the histogram in red). Apple is the most divisive brand, resulting in a wide range of coefficients. In contrast, a majority of respondents have a mildly positive view of Google and not many dislike it.

In addition to the distribution and mean value of the parameters, take note of the Prediction accuracy shown at the top of the output. This shows the percentage of questions in which the model correctly predicts the choice of most preferred. A number of different factors determine predictive accuracy:

The consistency within the data. If people have given near-random data, then prediction accuracy will always be poor.
The amount of data per person. The more data each person has provided, the higher the prediction accuracy, all else being equal.
The number of iterations. If you have too few iterations, you will get poor predictive accuracy. You check this by checking convergence, further discussed below.
Whether the predictions are in the sample or from cross-validation (as specified with the Questions left out for cross-validation control). Cross-validation predictive accuracy is typically considerably lower than in-sample accuracy.

Trace plots

The plots below can be created by selecting a Hierarchical Bayes output and clicking on Insert > More > Marketing > MaxDiff > Diagnostic > Trace Plots. The plots show how the main parameters change as the analysis progresses over the iterations for each chain. The grey half-section indicates the warm-up iterations, which are excluded from the final results. The non-grey part is the final result. With more than a couple of cores, these plots can be hard to read. Beyond the warm-up stage, the range of each chain should overlap almost entirely with those of the other chains. If one chain appears higher or lower than the others, this would indicate a problem with the model and more iterations may be required.

Posterior Intervals

You create this plot by selecting a Hierarchical Bayes output and clicking on Insert > More > Marketing > MaxDiff > Diagnostic > Posterior Intervals. This plot shows the range of the main parameters in the analysis. The black dot corresponds to the median, while the red line represents the 80% interval of the draws, and the thin black line is the 95% interval. The plot includes the distributions of the estimated means and standard deviations. This gives us an understanding of the uncertainty regarding conclusions shown in the earlier histograms. For example, comparing Apple and Google, we can see that there is little overlap in the distributions of the means, which tells us that the data shows a clear average preference for Apple being greater than for Google. Looking at the standard deviations, we can see that the value for Apple is, by far, the largest for any of the brands. This tells us that it is a divisive brand (i..e, some people love it, others like it much less), as we have observed earlier.

Parameter statistics

You can create this table by selecting a Hierarchical Bayes output and clicking on Insert > More > Marketing > MaxDiff > Diagnostic > Parameter Statistics. The table output shows:

the mean
the standard error of the mean
standard deviation
the effective sample size, and
the potential scale reduction statistic of the main parameters

What we want to see, are values of 100 or more in the n_eff column and ideally 1.1 or less in the Rhat column. Consequently, we can see problems with convergence for most of the standard deviation parameters in the below table. We need to re-run the model with more iterations until we fix these problems. I typically keep doubling the number of iterations until the problems go away. More detail about checking for convergence is in Checking Convergence When Using Hierarchical Bayes for MaxDiff.

Try your own MaxDiff Hierarchical Bayes

Using Hierarchical Bayes results in other analyses

Typically, once convergence has been established, the preferences for each respondent need to be extracted from the model and used in further analysis. The most straightforward approach to doing this is to select the model output (i.e., which shows the histograms and predictive accuracy), and select Insert > More > Marketing > MaxDiff > Save Variable(s) > Individual-Level Coefficients. This creates:

One new variable for each of the alternatives. This is the raw data that is plotted as histograms in the standard output.
A summary table showing the averages.

Individual-level coefficients can be difficult to interpret, as the scale is in logit-space. The most straightforward solution is to instead compute preference shares, which show, for each person, their probability of choosing each of the alternatives first: Insert > More > Marketing > MaxDiff > Save Variable(s) > Compute Preference Shares.

For more advanced analyses, such as preference simulations, it can be useful to extract the posterior draws. These are individual-level coefficients, where there are multiple coefficients for each person. These estimates reflect the uncertainty of the estimated preferences. To do this, select the output that contains your model and then go to Properties > R CODE and modify the R code to add the following optional parameter hb.beta.draws.to.keep = 100 to the function call of FitMaxDiff (this parameter sets the number of samples per respondent to keep, which I have chosen to be 100 in this case). To obtain the posterior draws, create a new R Output (Insert > R Output), and enter the following code, replacing max.diff with the name of the model output (which may be max.diff, if you have only created one MaxDiff analysis):

max.diff$beta.draws[, 1, ]

This returns a matrix of the draws of the first respondent. With the row and column dimensions corresponding to the iterations and variables respectively. Iterations from the multiple chains make up the iterations dimension, with the warm-up iterations excluded. To get the draws from another respondent, simply change the 1 in the code with the respondent's index.

A Short Course for Learning Displayr

Tim Bock — Mon, 04 Dec 2017 07:38:39 +0000

This post is written as a DIY training course in learning Displayr. It links off to a series of other posts, which have been written to be read in order, each one building on concepts from the previous post. This series of posts is each accompanied by a video.

Case Study 1: Getting started with a simple visualization

The first post to read is called Case Study: Creating a Simple Dashboard. It shows you how to:

Start a Displayr document.
Create a really cool visualization using data in an Excel spreadsheet.
Set this up as a dashboard that can be shared with others.

Case Study 2: The analysis of raw data from a survey

This post, Case Study: Analyzing Survey Data, shows how to:

Quickly create a simple report from a raw data file. The data file come from a simple shopping survey, but the basic principles are applicable to any type of raw data file (e.g., customer transaction data).
Create tables and charts.
Set up filters, both for your own use and for the end-users.

Case Study 3: The automatic updating of time series

The post Case Study: Automatically Updating Interactive Time Series Dashboard introduces you to:

Pulling in data using external APIs. The two that are illustrated are Google Trends and Yahoo Finance.
Have data automatically update and the dashboard automatically re-published.
Cool interactive charts.

Case Study 4: Getting started with advanced analysis

The post Case Study: Advanced Analysis in Displayr (MaxDiff) illustrates the basic workflow of performing advanced analyses in Displayr. Do not be too concerned if you are not familiar with MaxDiff, as the workflow that is illustrated is used with many other advanced analysis approaches as well. However, if you are really sure you are not interested, you can skip it.

Case Study 5: Visualizing data

Case Study: Visualizations illustrates many of the different ways of creating visualizations in Displayr, by creating a report of the data from Case Study: Advanced Analysis in Displayr (MaxDiff). You do not need to have read the earlier post in order to understand this post.

Case Study 6: Updating analyses and reporting with revised data

Case Study: Updating Reporting illustrates how to update data files so that the entire report, including all analyses, automatically updates.

Next steps

Having worked your way through the case studies, the next place to move on to is the overview of all the different features in Displayr: Introduction to Displayr 1: Overview.

Case Study: Advanced Analysis of Experimental Data (MaxDiff)

Tim Bock — Mon, 04 Dec 2017 07:37:21 +0000

This is the fourth in a series of blog posts designed as a DIY training course for using Displayr. This post presents the analysis of a relatively exotic experiment. Do not be concerned if you are unfamiliar with the technique (MaxDiff) and its data, as that is not the point of the post. This post is designed to introduce the basic workflow by which more advanced analyses are conducted in Displayr, and the basic principles illustrated in this post are applicable to most other types of modeling (e.g., machine learning, latent class analysis, regression).

Download our free MaxDiff ebook

If you are really sure that advanced analysis is not your thing, skip ahead to Case Study: Visualizations.

You can view a video of this dashboard being created here.

Step 1: Opening the document

Create a new Displayr document and call it Presidential MaxDiff.
Download the raw data file located here.
In the Ribbon, choose Insert > New Data Set and choose My Computer and select the data file that you just downloaded.

Step 2: Fitting the MaxDiff model

Open a new empty model from the menu:
- In the Ribbon: Insert > More (Analysis) > Marketing > MaxDiff > Hierarchical Bayes

Feed it some data and customize the model, in the Object Inspector on the Inputs tab:
- Design location: Provide a URL
- Design URL: http://docs.displayr.com/images/8/88/President_Experimental_Design.csv (MaxDiff is an experimental method, and its analysis requires both raw data and the experimental design).
- Version: MDversion: MaxDiff Version [MDversion]
- Best selections: Type mdmost and select the 10 variables. Make sure you select them in the correct order.
- Worst selections: Select the 10 mdleast variables.
- Add Alternative labels: Add these alternatives in the spreadsheet that opens: Decent/ethical, Plain-speaking, Healthy, Successful in business, Good in a crisis, Experienced in government, Concerned for the welfare of minorities, Understands economics, Concerned about global warming, Concerned about poverty, Has served in the military, Multilingual, Entertaining, Male, From a traditional American background, Christian
This calculation is going to take about 10 minutes or so (it is doing a lot!). In the meantime, you may want to skip ahead and start on the next post in this series: Case Study: Visualizations.

When the calculation has finished, you will see an output like the one below. If you are interested in understanding what it means, we have lots of posts about MaxDiff on our blog. However, there is no need to understand this output at this juncture, as the purpose of the post is to introduce the general workflow of using Displayr for advanced analyses.

Step 3: Extracting additional outputs from the model

Often when conducting an advanced analysis it is useful to extract certain things from the model, such as predictions, goodness-of-fit plots, and the like. The basic workflow for doing this in Displayr is to select the model output of interest, and then choose specific things to extract from menus.

Click on the output from the model and in the Object Inspector, select the diagnostics or other things to extract under Inputs > DIAGNOSTICS or Inputs > SAVE VARIABLE(S). Note that there are various things to extract in both menus. Click on Inputs > SAVE VARIABLE(S) > Save Preference Shares.

Once the calculation is complete, a new variable set will appear at the top of the Data Tree (that is, under Data Sets in the bottom left): Preference shares from max.diff.

Step 4: Writing up the results of the model

This post just illustrated the creation of the key outputs: the preference share variables. The next post in this series, series: Case Study: Visualizations, creates visualizations to explain the key outputs from this study.

Try it yourself

To see the document created by following the steps in this post, click here. Or...

Create your own MaxDiff Design

Comparing Tricked Logit and Rank-Ordered Logit with Ties for MaxDiff

Justin Yap — Tue, 17 Oct 2017 11:27:23 +0000

Create your own MaxDiff Design

Tricked logit

Multinomial logit is used to model data where respondents have selected one out of multiple alternatives. The logit probability of selecting given the utilities is

where denotes the set of alternatives.

In MaxDiff, respondents select two alternatives instead: their favourite (best) and least favourite (worst). Tricked logit models MaxDiff data by treating the best and worst selections as independent. The probability of the best selection is given by (1) whereas the probability for the worst selection is obtained by negating the utilities in (1):

This implies that the alternative that is most likely to be chosen as the best, is least likely to be chosen as the worst, and vice versa. However, this assumption of independence between the best and worst selections is unreasonable, as it does not rule out impossible scenarios where the same alternative is selected as both the best and worst.

So, why do people use tricked logit? It can be conducted by "tricking" existing software for multinomial logit into modeling the best and worst selections. This is done by duplicating the design matrix for the worst selections, except with the indicator value 1 replaced with -1. This approach is equivalent to (2). Another benefit is speed: with tricked logit the number of cases is only double that of multinomial logit, whereas the more correct model, rank-ordered logit with ties, is much more computationally intensive, as I will show below.

Rank-ordered logit with ties

Rank-ordered logit with ties is applied to situations where respondents are asked to rank alternatives from best to worst, with the possibility of ties. MaxDiff data can be analysed using rank-ordered logit with ties since selecting the best and worst alternatives is the same as ranking the best alternative first, the worst alternative last, and the other alternatives tied in second place. To compute the probability of selecting a particular pair of best and worst alternatives, the probabilities of every possible way in which the respondent could have ranked the alternatives that would have led to the observed best and worst alternatives are summed together. It can be shown (I shall not attempt to do so here) that the probability is given by

where is the set of combinations of alternatives other than the best and worst and denotes the number of alternatives in . As the number of combinations in is , the cost of computing the probability increases exponentially with the number of alternatives.

To learn more about this model, read Allison and Christakis (1994), Logit Models for Sets of Ranked Items, Sociological Methodoloy, Vol. 24, 199-228.

Prediction accuracies for Latent Class Analysis

I have compared both methods using the out-of-sample prediction accuracies for individual respondents with the technology companies dataset used in previous blog posts on MaxDiff. One of the simplest models to test on is latent class analysis. The line chart below plots prediction accuracies for both methods for a latent class analysis with five classes over random seeds 1 to 100. The models were fitted using four out of six of the questions given to each respondent and prediction accuracies were obtained from the remaining two questions. The seeds were used to generate the random selection of questions to be used in-sample and out-of-sample, so that different sets of questions are used with each seed. The line chart suggests that the two methods have a similar level of prediction accuracy. The mean difference between them is 0.14%, while the standard error of this mean is 0.16%. As the standard error of the mean is larger than the mean I consider their difference to be insignificant from zero and therefore they are tied in this comparison.

Prediction accuracies for Hierarchical Bayes

Hierarchical Bayes is a much more flexible model compared to latent class analysis and it provides superior out-of-sample predictive performance. This can be seen by comparing the previous chart with the line chart below for Hierarchical Bayes: the prediction accuracy is around 10 percentage points higher. The correlation between the accuracies from both methods is higher too at 92% compared to 60%. Again, the two methods have a similar level of prediction accuracy. The mean difference between them is 0.06%, while the standard error of this mean is also 0.06%, which implies their difference is insignificant from zero.

Estimated parameters

The results so far have indicated that the two methods give similar results. The radar charts below show the mean (absolute value) and standard deviation parameter values for each alternative, averaged over 20 random seeds. Tricked logit has slightly larger mean parameters and slightly larger standard deviations. I think this is due to the different way in which the parameters are converted to probabilities. For the same set of parameters, I have found that rank-ordered logit tends to yield higher probability. So, to compensate for this, their parameters tend to be closer to zero. I wouldn't read too much into this difference as it probably does not impact results.

Summary

While rank-ordered logit is better supported by theory than tricked logit, I have found that the two perform the same when it comes to predicting individual responses for MaxDiff. As tricked logit is less computationally intensive, I would recommend using it when analysing MaxDiff data. Note that MaxDiff is only a trivial problem for rank-ordered logit, which is able to be applied in more general situations where respondents are asked to rank alternatives with the possibility of ties, and shortcut solutions such as tricked logit do not exist. To run tricked logit and rank-ordered logit in Displayr on the data used in this blog post, or on your own data set, click here.

11 Tips for your own MaxDiff Analysis

Tim Bock — Thu, 03 Aug 2017 01:22:31 +0000

If you are a MaxDiff analysis novice, please check out A Beginner's Guide to MaxDiff analysis before reading this post.

Download our free MaxDiff eBook!

1. Keep it simple (particularly if it is your first MaxDiff analysis)

MaxDiff analysis projects come in all shapes and sizes. They vary on the following dimensions:

Sample size. I have seen them from 40 respondents all the way up to more than 6,000.
The number of separate MaxDiff experiments in the study. Typically people have only a single experiment, but I have seen as many as five in the one project (one for each market segment, with different attributes in each).
How many alternatives there are (e.g., 10 in a small study, 100 in a huge study).
The number of versions of the experimental design, from 1 through to 100's.
How many times each alternative is shown to each person. At least 3 is good practice. With less than this number, you need to rely heavily on statistical wizardry.
The number of separate analysis objectives (e.g., segmentation, cross-category segmentation, profiling).
The speed with which the analysis needs to be done, from hours through to months.

Create your own MaxDiff Design!

If it is your first study, make sure you "go safe" (simple) on each of these dimensions. That is, if possible, have a small sample size, a single experiment, a small number of alternatives, only one version, show each alternative to each person 3 times, have a single and clear analysis objective (e.g., segmentation), and set aside a lot time (think weeks if it is your very first study).

If you find yourself in a situation where you need to "go large" on your first project, there are a couple of simple hacks. The obvious one is to outsource. But, the alternatives are to have a go at recreating and understanding old studies you have outsourced or working through case studies or MaxDiff examples, such as the MaxDiff analysis examples here.

2. Avoid having multiple versions unless you really, really, need them

Having multiple versions of an experimental design causes two problems. First, it is not ideal for segmentation, as the point of segmentation is to group together people that are similar, and having a design with multiple versions reduces your ability to work out if people are similar. Second, it is just an additional complication that can cause errors in fieldwork or analysis. So, unless you have a good reason to have multiple versions, such as too many alternatives for a single version, it is best not to have multiple versions in my opinion.

3. Estimate your models before you are out of field

Do you wear a seat belt when you drive? The MaxDiff (and choice modeling) equivalent of a seat belt is to run all your models based on the initial interviews, ideally halting interviewing until your initial analysis is complete. I have done this on every single consulting project I have ever performed. If it is your first MaxDiff project, and you don't do this, you are choosing to learn to drive without a seat belt.

Estimating your models before having completed field work achieves three goals, it:

Allows identification of problems in the experimental design prior to it being too late to do anything to fix them. See also How to Check an Experimental Design (MaxDiff, Choice Modeling).
Leads to a fast identification of field work problems when there is time to fix them.
Provides forewarning of likely analysis problems. Some of the more common analysis problems are described below. All of them take time to fix. If you discover them while in the field, they can usually be solved before it becomes a client-facing car crash.

4. Start with latent class analysis

There are five basic ways to analyze MaxDiff experiments:

Counting analysis (aka counts analysis).
Latent class analysis.
Standard Hierarchical Bayes.
Mixtures of Normals with a Full Covariance Matrix, either estimated using Maximum Simulated Likelihood (in Q) or Hierarchical Bayes (via the R package 'bayesm').
More complex models, such as varying coefficients, constrained mixtures of normals, nonnormal distributions, generalized mixed logit, etc.

The first of these, counting analysis, is invalid. You are better off using a traditional rating scale than using counting analysis with MaxDiff.

Latent class analysis is the safest of all of these methods.

Create your own MaxDiff Latent Class Analysis!

Why is latent class analysis the safest?

Latent class analysis has a few big advantages over the more complex methods:

It is easy to interpret. All the other advanced methods require either technical expertise or, post-processing of respondent-level data, in order to interpret them correctly. With latent class analysis, you just look at the segment sizes and descriptions, which makes it simple to understand. Simple to understand means you quickly find problems and/or insights.
It is the best default method for segmentation. Latent class analysis creates segments. This is precisely what it is designed to do. Yes, there are other methods that can also create segments (see Tip 11). However, they are two-step methods (first compute respondent-level results, then cluster them), and errors are introduced in each step. On the other hand, latent class analysis involves only a single step and thus, all else being equal, involves less error.
It is a safe model. The standard Hierarchical Bayes model, which is available in Sawtooth, is usually, in a statistical sense, a bit better than latent class analysis. But, it can sometimes be much worse (particularly for segmentation). In particular, if there are a small number of discrete segments, a latent class analysis will likely find them but the standard Hierarchical Bayes model will likely not.

See How MaxDiff Analysis Works (Simplish, but Not for Dummies) for an intro into MaxDiff analysis.

5. Increase the number of segments if you get 'Other' or correlated segments

Latent class analysis can lead to uninteresting segments. There are two common flavors of uninteresting segments that arise in latent class analysis:

Correlated segments. Segments that have the same top alternative(s), differing only in relativities among the less preferred alternatives. If you are looking at preference shares (aka probability %), you can even have segments that appear to be identical, because the differences all relate to preferences for the Worst option, but they are all rounded to 0% so cannot be seen.
An 'Other' segment, where everything is somewhat important and few alternatives are unimportant.

The 6-segment latent class solution below illustrates both of these types of poor segments. This segmentation looks at preferences for tech companies. Segment 1 and 3 are both illustrating strong preferences for Samsung. Segments 2, 4, and 6 strongly prefer Apple. Two of the segments (5, and to an extent 4) have highly mixed preferences.

Correlated segments

Compare segments 2 and 6. People in these segments have a very strong preference for Apple. As Apple is the most preferred brand in the study, it makes sense that we would have Apple devotees split into multiple segments, in terms of their second preferences. In segment 2, the hardware brands Samsung and Sony are the second and third most preferred brands. In segment 6, by contrast, the secondary preferences are for the software brands, Google and Microsoft. These segments make sense. It is just that from a managerial perspective they are perhaps not very interesting, and latent class analysis is only focused on the statistical properties of the data rather than the managerial significance.

Another cause of correlated segments is where people have the same basic preferences, but they differ in the amount of noise (i.e., inconsistencies) in their data.

Yet another cause of correlated segments is when a dominant attribute has been included. For example, if you have an attribute like "taste" in a study of food or drinks or "quality" in a technology market, there is a good chance it will be important in all of your segments.

Other segments

In the case of the 'Other' segment, there are two reasons why it can occur. It can occur because you force together lots of very small segments. Or, because there is a segment of people that answer questions in a highly inconsistent fashion.

A solution

A solution to this problem is often just to increase the number of segments, rather than using judgment to merge together similar segments. This does not always solve the problem. Tip 11 provides a different solution which usually does the job.

Create your own MaxDiff Design!

6. Switch to a continuous mixture model if you get 'Other' or correlated segments

Latent class analysis assumes that there are a number of segments in the data. This assumption can be wrong. When wrong, it can manifest itself via uninteresting segments (see Tip 5). Most of the more complicated models instead assume that people vary on continua. The simplest of the models is the standard Standard Hierarchical Bayes, but see also the third, fourth, and fifth models described in Tip 4.

7. Compare multiple models

While you should start with latent class analysis, and it will often do the job if your focus is segmentation, it is usually a good idea to compare different models. This is doubly true if you have an interesting model. Three basic ways of choosing between models are:

Statistical comparisons. The best approach is usually to compare based on cross-validation, but most latent class analyses do not support this, so you can use the BIC instead.
Based on the extent to which their respondent-level preference shares are correlated with demographics. I discuss respondent-level preference shares in Tip 11.
Strategic usefulness. If you have two different models, and there is no way to separate them on statistical grounds, it is reasonable to choose the model which gives the most interesting strategic conclusions.

8. Use a powerful computer

These days most computers are OK for most problems. Large MaxDiff studies, however, are not a normal problem. Some big analyses can take days to run a single model. A faster computer - more memory, fast CPU - can do things in one-tenth the time as a slow old clunky one. If you have a large sample size (e.g., more than 2,000), a large number of versions, or a large number of alternatives, you will really appreciate using a powerful computer. If you fail to heed this advice, your best case is that you spend a lot of time waiting. The worst case is that your computer crashes and you cannot compute get any results without getting a faster computer.

9. Run multiple models at the same time

Usually, there is a need to run multiple models. At the least, comparing latent class analysis to a continuous mixture model. If the models are really slow, the simple solution is to run multiple models at the same time. If you are using a cloud-based app, like Displayr, you can get it to do it for you. But the simple hack, if you are using a desktop program like Sawtooth or Q, is to open the program multiple times, perhaps even on multiple computers.

10. Choose fast algorithms for big problems

There are lots of different MaxDiff algorithms. Some are faster than others. This makes a big difference if you have a big MaxDiff analysis. To the best of my knowledge, the fastest safe approaches are, in order:

Latent class analysis with a small number of segments (e.g., 5 or less).
Hierarchical Bayes.
Normal mixing distribution with a full covariance matrix estimated via EM algorithm. You can implement this exotic algorithm in Q by setting up the MaxDiff analysis as a Ranking question, and using Create > Latent Class Analysis > Advanced, set the distribution to Multivariate Normal - Full Covariance, and unchecking the Pooling option, estimating only a single class/segment. This model is theoretically very similar to Hierarchical Bayes, but in practice seems a bit worse than Sawtooth's Hierarchical Bayes.
A mixture of normal mixing distributions. This is a combination of the normal mixing distribution and latent class analysis (i.e., it is latent class analysis, where each class has its own multivariate normal mixing distribution).

11. Use cluster analysis if latent class analysis does not give you good segments

If you are wanting to do segmentation, a latent class analysis should always be your first port of call, as described in Tip 4. However, if you have time, or, you get uninteresting segments (see Tip 5), a good alternative is to:

Estimate a continuous mixture model (see Tip 4).
Estimate respondent-level preference share estimates. See the section Respondent-level preference shares in How MaxDiff Analysis Works (Simplish, but Not for Dummies) for more information about computing respondent-level preference shares.
Form segments using cluster analysis.
Use the norm tricks to make the cluster analysis good. For example, if you find segments dominated by a preference for a particular brand, you can leave out the variable that relates to this brand, merge together segments, scale the variables, conduct cluster analysis within a segment, etc.

Check out more tips on Market Research, or run your own MaxDiff analysis in Displayr!

Create your own MaxDiff Design!

This analysis was done in Displayr. To see Displayr in action, grab a demo.

Grab it here

How to Check an Experimental Design (MaxDiff, Choice Modeling)

Tim Bock — Mon, 24 Jul 2017 18:38:18 +0000

In this post, I explain the basic process that I tend to follow when doing a rough-and-ready check of an experimental design. The last step, Checking with a small sample, is the gold-standard. I've never heard a good excuse for not doing this.

Every now and then somebody sends me an experimental design and says, "Can you please check it for me, I need to know if it is OK, it is really urgent!". Usually, they also send an accompanying spreadsheet containing big tables of numbers. I understand why people want somebody to check their designs. It is a scary thing to conduct an experiment. Small mistakes can ruin the whole thing!

Most of my experiments have involved marketing, economics, social research, and surveys. In particular, I spent a couple of decades doing choice modeling, conjoint, and MaxDiff experiments. So if you are looking for information to help you plan a clinical trial, crop experiment, or running a factory, this post is going to be a waste of your time.

Download our free MaxDiff ebook

Checking where the experimental design came from

Just like with food and drugs, checking the provenance of an experimental design saves a lot of time. If somebody sends me a design created by software that I or one of my colleagues has written, then it is pretty quick for me to check it. If the design used SAS, JMP or Sawtooth, then again it is pretty easy. In these situations, all I really need to check is that the user has clicked the right buttons. But, where the design is of unknown provenance, life gets a lot more difficult. There is the possibility that it is completely wrong.

Checking experimental designs with heuristics

The most straightforward way to check a design is to use rules of thumb (i.e., heuristics). Different heuristics have been developed for pretty much every type of design, and you should definitely try and find ones applicable to whatever problem you are trying to solve (e.g., experimental designs for MaxDiff, drug trials, or for process modeling). However, the following heuristics pop up in many different fields. Please tell me if you know of some good ones that I have missed. But, before reading these please keep in mind that they are all heuristics. There are times when you can achieve better designs by ignoring some of them.

Frequency

Does each experimental manipulation occur enough times to conduct a sensible statistical analysis? For example:

In a completely randomized design with a single factor, where each subject is randomly allocated to a treatment (e.g., a new drug, existing drug, or placebo), it is possible to compute power calculations, working out the appropriate minimum sample size if something is known about the likely efficacy of different treatments. In fields where there is insufficient knowledge to work out the likely efficacy, different rules of thumb exist. For example, in a marketing experiment, non-academics are rarely impressed by sample sizes of less than 100. In sensory research, where the experimental manipulations are stronger, sample sizes of 40 can be considered OK.
With a MaxDiff experiment, a common rule of thumb is that each alternative needs to be shown to each person at least three times if there is a need to estimate each person's preferences (e.g., if conducting segmentation).

Balance

Does each manipulation occur (roughly) the same numbers of times? For example, in a completely randomized experimental design, is the sample size of each group the same? In MaxDiff experiment, does each alternative get shown to each respondent the same number of times? In choice modeling, does each level of an attribute get shown the same number of times?

Minimal correlation

In experiments where multiple manipulations occur (i.e., multi-factor experiments), it is usually a good idea to check that the manipulations are uncorrelated. For example, in an experiment where manipulating, say, colors and prices, it is usually desirable to check that their certain prices are not more or less likely to appear with specific colors. That is, usually, it is desirable to have no correlation between the factors. The term orthogonal means that the variables have a correlation of 0. (Note that outside of experimental design, correlation and orthogonality have different meanings).

If you have studied some experimental design at college, you may have come across the idea that there should be no correlation of any kind between experimental factors. This is far from always true. For example:

In choice modeling studies in many markets, it is desirable to have correlations between variables. For example, in the car market, you would generally want the experimental design to have correlations between brand and price. Otherwise, you will end up collecting silly data. There is no point showing people a Lamborghini for $10,000 or a Corolla for $1,000,000.
In studies with prohibitions (e.g., certain combinations of drugs that should never be administered together), negative correlations are guaranteed.
In studies where there are constraints regarding the number of manipulations shown at any time, there will be negative correlations (e.g., MaxDiff designs)

Checking the randomization mechanism

Experimental designs typically need to involve some form of randomization. For example, if you are allocating people to receive a drug or to receive a placebo, it is important to allocate them randomly. If you were to give the sicker-looking patients the placebo, this would likely exaggerate the efficacy of a drug.

In my experience, the single major cause of errors in experiments relates to people failing to randomize properly. "Properly" means using statistical software with a random number generator (e.g., R). To illustrate the types of things people get wrong, here a few examples of mistakes that caused me a lot of pain:

In a choice modeling study, I had about 1,000 respondents and 100 different choice sets (treatments). I wanted to have each person see 10 choice sets, with random allocation of people to treatments. Each choice set was supposed to be seen exactly 100 times. Unfortunately, the company that collected the data discovered that their software was only able to randomize 10 times. So, they came up with the ingenious solution of randomly allocating people to different treatments based on the time (e.g., in the first 36 seconds showing the first choice set, then the second in the next 36 seconds). Needless to say, this did not work out. Unfortunately, it took eight weeks of back-and-forth with the company before the owned up to the error. The study had to be redone. Everybody involved lost a lot of money. One guy's health broke down and he left the industry due to the stress.
In another study where I had 200 different treatments and wanted 25 people per treatment, the company collecting the data randomly assigned each person to one of the 200 treatments. Unfortunately, the way that randomization works mean that one of the treatments was seen only 8 times, and another 38, with the rest in-between. More data had to be collected until each was seen 25 times, which cost much more money.
In an AB test looking at the effectiveness of alternative wordings of different emails, everybody who was considered unimportant was assigned the "standard'" wording, and some people considered to be important were given the "personalized" wording. This made it impossible to disentangle whether the wording or importance of the customer was driving response rates.

How can you check if the randomization is working? If I don't know the person or company well, I will usually ask to see the code they are using or ask a lot of questions. Whether or not I do know them, I will generally always get them to do an initial pilot or soft send of the experiment, and check the frequency, balance, and correlations (see the earlier section on heuristics).

Efficiency

Whenever we conduct an experimental design we estimate a parameter. For example, the coefficient of a choice model, or the proportion of people to prefer an alternative in a taste test. If we square the standard error of this parameter estimate we have its variance. To calculate the efficiency of a parameter estimate, you divide 1 by this variance. The D-efficiency is a measure of the efficiency of all the parameters in a specific model.

One way to check any experimental design is to compute its D-efficiency. In practice, this is useful if using optimization software to generate an experimental design or if wanting to compare the statistical properties of two or more designs. This method is still inferior to Checking on small sample, which I discuss next.

Checking on a small sample

On any study that I work on, I always do the following:

Get a data file after about 10% of the data has been collected. Either this will be 10% of the final sample, or, just a pilot study.
When doing something that I have not done many times before, get the field work to stop at this point.
Review the basic heuristics that are applicable to check that the randomization is working (see Checking the randomization mechanism).
Estimate the models that I need to estimate, taking care to look at the standard errors of all the parameters.
Form preliminary conclusions. That is, check that the model is telling me what I need to know for the project to be a success. Sure the standard errors will be relatively high, but key conclusions should still be making sense at this time.
If everything makes sense continue with the data collection.

You can conduct this process along with all of the other approaches. Or, if you are brave, you can just do this step and skip the earlier approaches. But, skipping testing on a small sample is foolhardy, as it checks things much more thoroughly than the other approaches.

This approach is the only one that checks for clerical errors. That is, it's possible you have a great design, but due to clerical errors, it is not administered correctly. It also allows you to recover if you have made a mistake in the entire conception of the experiment. For instance, sometimes choice modeling studies inadvertently include a couple of factors (attributes) that are so important that everything else becomes insignificant. Where the hypotheses of interest relate to the insignificant factors, this is a big problem. It is best to identify this kind of problem before you have finished the fieldwork. Otherwise, it cannot be fixed.

One last little comment. As a pretty general rule, people get more diligent at checking over time, as they learn from pain. If this is your first experiment, make sure you do everything listed in this post.

Download our free MaxDiff ebook

Using Cross-Validation to Measure MaxDiff Performance

Justin Yap — Mon, 22 May 2017 22:52:08 +0000

This post compares various approaches to analyzing MaxDiff data using a method known as cross-validation. Before you read this post, make sure you first read How MaxDiff analysis works, which describes many of the approaches mentioned in this post.

Download our free MaxDiff ebook

Cross-validation

Cross-validation refers to the general practice of fitting a statistical model to part of a data set (in-sample data), and then evaluating the performance of the model using the remaining data (out-of-sample data). This is done by using the model to make predictions with the out-of-sample predictors and comparing the predictions to the out-of-sample outcomes. Cross-validation is popular because it can be uniformly applied to almost any predictive model without needing to deal with the theoretical details of the model. For this comparison, I will partition the data by randomly leaving out 1 of the 6 questions per respondent (out-of-sample), so that the remaining 5 questions will be part of the in-sample data. Performance will be measured by the proportion of respondents for which the best alternative was correctly predicted in the left-out question, using the individual-level parameters estimated using the in-sample data to make predictions.

Latent class analysis

The first model that I shall look at is latent class analysis, or more specifically latent class rank-ordered logit with ties. The table below shows the results for a 3-class latent class analysis on the technology data set seen in previous blog posts, in which respondents are asked to choose between technology companies. The in-sample prediction accuracy in this case is 60.8%, which is the percentage of questions in which the best alternative was correctly predicted. Note that the predictions are being made on the same data that was used to fit the model, which means that the prediction accuracy will be over-optimistic compared to out-of-sample prediction accuracies.

To test this statement, I run the same model but with one question per respondent randomly left out. The out-of-sample prediction accuracy is 56.3%, which as expected is smaller than the in-sample prediction accuracy seen previously. This value is the percentage of respondents in which the best alternative was correctly predicted in the question that was left out.

Boosted varying coefficients

The next model is known as boosted varying coefficients, which is a model that we have developed to make use of covariate variables that are often present in the data set. This model works by first running latent class analyses over the levels in the covariates, instead of over respondents, i.e., it assigns class membership probabilities to levels of covariates, instead of to respondents. The number of classes to use is chosen by iterating through all possible class sizes (up to the number of levels), and selecting the model with the best Bayesian information criterion (BIC). From the best model, individual-level parameters are computed for each respondent. If the one-class model turns out to be the best in terms of BIC, no model is selected. This process is repeated for each covariate, where subsequent models are boosted by the individual-level parameters from the previous model. By boosting, I mean that individual-level parameters are added to the usual model parameters when running the model. The resulting individual-level parameters from this process are finally used to boost an individual-level latent class analysis (over respondents).

The table below shows the results from a boosted varying coefficients model where the covariates are the likelihood of recommending certain companies (Apple, Microsoft, Google and Samsung). From the subtitle, we can see that 3 classes were chosen for Apple, Microsoft was excluded and 2 classes were chosen for Google and Samsung. A 3-class latent class analysis is run at the end, which is the same number of classes that was used in the previous section. The in-sample prediction accuracy is 62.3%, which is higher than that from latent class analysis by itself. However, this is somewhat expected, since the boosted varying coefficients model has a larger number of parameters. To account for this, we compare the BIC, which is lower: 8696 versus 8873 in the latent class analysis. Based on the BIC, boosted varying coefficients is a better model.

The table below shows the results from a boosted varying coefficients with one question left out. As before, the prediction accuracy drops to 58.3%, but this is still better than the 56.3% seen with the latent class analysis. This further confirms that boosted varying coefficients is superior to latent class analysis.

Increasing the number of classes

The choice of 3 classes was arbitrary. With cross-validation, I can determine if another number of classes results in a better model. The table below shows the results of a model which is the same as before but with 3 classes replaced with 5. Prediction accuracy has gone up and BIC has gone down, from which I conclude that 5 classes is better than 3. This process can be repeated until the optimum number of classes is found.

Increasing the number of questions left out

So far we have only shown results where one question has been left out. If more questions are left out, we reduce the amount of data available to fit the model, which negatively affects performance, as shown in the table below. However, with 2 questions left out, I expect the variability in the prediction accuracy to go down, as we are measuring performance over twice as many data points (although number of questions per respondent in the in-sample data set shrinks from 5 to 4). There is no exact answer to how many questions to leave out, but I would recommend not leaving out half or more of the questions so that there is sufficient in-sample data to fit the model.

Summary

We have seen that cross-validation is a simple but powerful method of comparing MaxDiff model performance, with the results matching those given by BIC (i.e. when prediction accuracy increases, BIC goes down). The results above indicate that when everything else is kept the same, the boosted varying coefficients model outperforms latent class analysis.

If you click here, you can login to Displayr and see all the analyses that were used in this post.

How to Analyze MaxDiff Data in Displayr

Chris Facer — Mon, 22 May 2017 20:47:38 +0000

This post discusses a number of options that are available in Displayr for analyzing data from MaxDiff experiments. For a more detailed explanation of how to analyze MaxDiff, and what the outputs mean, you should read the post How MaxDiff analysis works.

The post will cover counts analysis first, before moving on to bringing in the experimental design and hooking it up to the more powerful technique of latent class analysis. Finally, we look at some examples of how your MaxDiff analysis can be customized and taken even further with R code.

The data set that we used for the examples in this post can be found here, and the URL that can be pasted in to obtain the experimental design is:

http://wiki.q-researchsoftware.com/images/7/78/Technology_MaxDiff_Design.csv

Download our free MaxDiff ebook

Counting analysis

To view counts of the best and worst selections, you can:

Select Insert > More > Tables > Raw Data.
Select the variables for the best, or most-preferred, selections in Variables. In this example, the variables are named Q5a: Most, Q5b Most, etc.
Click Properties > GENERAL, and change the Label to: best
Repeat steps 1 to 3, for the worst, variables, which in this example are named Q5a: Least, Q5b Least, etc.
Click Properties > GENERAL, and change the Label to: worst
For each of the three snippets of code below, select Insert > R Output, and paste in the code.

This code will count up the number of times each alternative was selected as best:

b = unlist(best) # Turning the 6 variables into 1 variable
t = table(b) # Creating a table
s = sort(t, decreasing = TRUE) # Sorting from highest to lowest
# Putting a name at the top of the column, and naming it.
best.score = cbind(Best = s)

This code will count up the number of selections for each of the alternatives in your MaxDiff experiment:

b = table(unlist(best))
best.and.worst = cbind(Best = b, 
                       Worst = table(unlist(worst)))[order(b, decreasing = TRUE),]

To compute the difference between the best and worst counts, first create the output above for counting the best and worst, and then use this code:

diff = best.and.worst[, 1] - best.and.worst[, 2]
cbind("Best - Worst" = diff)

The experimental design

In order to analyse the MaxDiff responses with more advanced techniques (like latent class analysis, discussed below), the survey data must be combined with the experimental design. For more on how to set up the design, see How to create a MaxDiff experimental design in Displayr.

If you have created your MaxDiff design in Displayr previously, using Insert > More > Marketing > MaxDiff > Experimental design, then you don't need to do anything special here, and you can skip to the next section. If your experimental design has been created elsewhere, then there are two ways you can bring it in to Displayr:

Paste in the design from a spreadsheet. To do so, select Home > Paste Table (Data), then click the Add Data button on the right, paste your design into the Excel-like grid (as below), and click OK.
Upload your design file to the web, and host it with a publicly-available URL.

For investigating the results from this post, it is easiest to use the URL option, and paste in the following URL:

http://wiki.q-researchsoftware.com/images/7/78/Technology_MaxDiff_Design.csv

In both cases, your design needs to have a layout like the table shown below. The first two columns denote the version number and task number for each task, and the remaining columns indicate which alternatives are shown in each task.

Latent class analysis

Displayr can analyze MaxDiff using latent class analysis. For a more detailed explanation of what this means, and how to interpret the outputs, see this post.

To set up the analysis you should:

Import your survey data file by selecting Home > Data Set (Data) and then following the prompts.
Add the latent class analysis to your project by selecting Insert > More > Marketing > MaxDiff > Latent Class Analysis.
Select your experimental design. Change Design location to Provide a URL and then paste in the URL found at the beginning of this post.
In Best selections, choose the variables in your data set which identify the options that were selected as best, or most preferred, in each task. The order of the variables you have selected should match the order from the design (i.e., the variable for the first task should be selected first, the variable from the second task should be selected next, and so on).
In Worst selections, choose the variables in your data set which identify the options that were selected as worst, or least preferred, in each task.
Click on Add alternative labels, and enter the alternative names in the first column of the spreadsheet editor. The order of the alternatives should match the order in the design. For example, in the technology study discussed here, the first alternative in the design is Apple, the second is Google, the third is Samsung, and so on.
Choose the Number of classes. See this post for guidance on how to do so.

Changing the Output option to Classes and recalculating the output will allow you to see the shares for each class.

Saving the classes

The classes that are identified in the latent class analysis can be used to profile other questions from your data set. To save a new variable which assigns each respondent to a class, select your latent class analysis output, and then use Insert > More > Marketing > MaxDiff > Save Variable(s) > Membership. This will create a new nominal variable that you can use in crosstabs with other questions.

Note that in latent class analysis, each respondent has a probability of being assigned to each class, but when class membership is saved using this option, each respondent is assigned to the class with the highest probability. As a result, there will be some difference between the class sizes reported in the latent class analysis output and the number of respondents that are ultimately assigned to each class. To view the class membership probabilities, use Insert > More > Marketing > MaxDiff > Save Variables > Class Membership Probabilities.

Respondent-level preference shares

To create variables which contain estimated preference shares for each respondent, based on the latent class analysis, select Insert > More > Marketing > MaxDiff > Save Variable(s) > Compute Preference Shares. When shown in a SUMMARY table, the output will show an average of the preference share for each alternative. To see the shares assigned to each respondent, change the selection in the Home > Columns menu to RAW DATA.

Charting preference shares

Any table showing preference shares can be used to create a visualization. One nice example is the donut plot shown below. With a little code, you can also extract the Total column from the latent class analysis (or any of the columns, for that matter), so that you can get them in to your plot.

Having set up the latent class analysis you can:

Select Insert > R Output.
Paste the code below into the R CODE section.
Click on your latent class output, select Properties > GENERAL, and copy the Name if it is different to latent.class.analysis.
Go back to your new output, and paste in the name in place of latent.class.analysis.
Change the number at the end of the first line to refer to the number of the column you want to plot.
Click Calculate.

The code for extracting the column of preference shares is:

pref.shares = latent.class.analysis$class.preference.shares[, 6]
pref.shares = sort(pref.shares, decreasing = TRUE) * 100

Here, latent.class.analysis$class.preference.shares[, 6] simply refers to the 6th column in the latent class analysis output table. For the output with 5 classes, this simply refers to the Total column. If you want to chart the shares for a particular class, you could simply change 6 for the number of the class you want to chart.

To make the donut chart:

Select Insert > Visualization > Donut Chart.
Choose the pref.shares table in the Table menu.
Click Calculate.

Preference simulation

Here, preference simulation refers to the process of removing some alternatives from the calculated shares, and then rebasing the remaining shares to see how they adjust in the absence of the removed alternatives. While the calculation is slightly more complicated than the ones that we have looked at so far in this post, it is still really straight-forward

Examine your latent class output and work out the row numbers for the alternatives.
Select Insert > R Output, and paste the snippet of code below into the R CODE section.
Modify the first line of your output if it is not called latent.class.analysis (remember, to find the correct name to use, select your latent class analysis and look at Properties > GENERAL > Name).
Modify line 5 to remove the rows that you want to exclude. In the example snippet, the brands of interest were in lines 1 and 6. The minus sign indicates that the rows should be excluded from the new table of shares. For example, if in your analysis you want to remove rows 2, 3, and 9, you would change x[ c(-1, -6), ] to x[ c(-2, -3, -9), ] in the code.

input.analysis = latent.class.analysis
# Remove the total column
x = input.analysis$class.preference.shares[, -6] * 100
# Removing Apple and Samsung
x = x[c(-1, -6),]
# Re-basing
x = prop.table(x, 2)* 100
# Adding the total
sizes = input.analysis$class.sizes
x = cbind(x, Total = rowSums(sweep(x, 2, sizes, "*")))
new.preferences = rbind(x, Total = colSums(x))

Summary

Displayr includes a powerful latent class analysis tool that can be used to analyze data from a MaxDiff experiment by combining the survey responses with the information from the experimental design. By adding a little R code, you can customize your analysis even further.

Click here for an interactive tutorial on MaxDiff

How MaxDiff Analysis Works (Simplish, but Not for Dummies)

Tim Bock — Mon, 22 May 2017 20:42:26 +0000

Download our free MaxDiff eBook!

Counting the best scores (super-simple, super risky)

The simplest way to analyze MaxDiff data is to count up how many people selected each alternative as being most preferred. The table below shows the scores. Apple is best. Google is second best.

This ignores our data on which alternative is worst. We should at least look at that. It shows us something interesting. While Apple is clearly the most popular, it has its fair share of detractors. So, just focusing on its best scores does not tell the true story.

The next table shows the differences. It now shows that Apple and Google are almost tied in preference. But, we know from just looking at the best scores, that this is not correct!

What is going on here? First, Apple is the most popular brand. This last table is just misleading. Second, and less obviously, the reason that the last table tells us a different story is that Apple is a divisive brand. It has lots of adherents and a fair number of detractors. This means that we need to be focused on measuring preferences at the respondent level, and grouping similar respondents (i.e., segmentation). As we will soon see, there is a third problem lurking in this simplistic analysis, and we will only find it by turning up the heat on our stats.

Looking at best and worst scores by respondent

The table below shows the MaxDiff experimental design used when collecting the data. The choices of the first respondent in the data set are shown by color. Blue shows which alternative was chosen as best. Red for worst. The question that we are trying to answer is, what is the respondent's rank ordering of preference between the 10 tech brands?

The simplest solution is to count up the number of times each option is chosen, giving a score of 1 for each time it is chosen as best and -1 for each time it is chosen as worst. This leads to the following scores, and rank ordering, of the brands:

Microsoft 3 > Google 1 = Samung 1 = Dell 1 > Apple = Intel = Sony > Yahoo -1 > Nokia -2 > IBM -3

This approach is very simple, and far from scientific. Look at Yahoo. Yes, it was chosen as worst once, and our counting analysis suggests it is the third worst brand, less appealing to the respondent than each of Apple, Intel, and Sony. However, look more carefully at Question 5. Yahoo has been compared with Microsoft, Google, Samsung and Dell. These are the brands that the respondent chose as most preferred in the experiment, and thus the data suggests that they are all better than Apple, Intel, and Sony. That is, there is no evidence that Yahoo is actually worse than Apple, Intel, and Sony. The counting analysis is simple but wrong.

A more rigorous analysis

We make the analysis more rigorous by taking into account which alternative was compared with which others. This makes a difference because not all combinations of alternatives can be tested, as it would lead to enormous fatigue. We have already concluded that Yahoo is no different from Apple, Intel, and Sony, which leads to:

Microsoft > Google = Samsung = Dell > Apple = Intel = Sony = Yahoo > Nokia > IBM

Which brand is the second most preferred? Each of Samsung, Google, and Dell have been chosen as best once. Does this mean they are all in equal second? No, it does not. In Question 4, Dell was against Google, and Google was preferred. Thus, we know that:

Microsoft > Google > Dell > Apple = Intel = Sony = Yahoo > Nokia > IBM

But, note that I have removed Samsung. Samsung is a problem. It may be between Microsoft and Google. It may be between Google and Dell. Or, it may be less than Dell. There is no way we can tell! We can guess that it has the same appeal as Dell. I have drawn Samsung in blue, as while the guess is not silly, it is, nevertheless, a not-super-educated guess:

Microsoft > Google > Samsung = Dell > Apple, Intel, Sony, Yahoo > Nokia > IBM

A more difficult problem is posed by respondent 13's data. She chose Apple twice as best, Samsung twice, and Google and IBM once each. Which is her favorite? Here it gets really ugly. The data shows that:

Apple > Google in 1 place (Question 1)
Apple > IBM (Question 1)
IBM > Google (Question 4)
Google > Samsung (Question 5)
Samsung > Apple (Question 6)
Samsung > IBM (Question 6)

This data is contradictory. Look at the first three points. They tell us that Apple > IBM = Google. But, the last 3 tell us that Google > Samsung > Apple = IBM.

Most people's instinct when confronted by data like this is to say that the data is bad and to chuck it away. Unfortunately, it is not so simple. It turns out most of us give inconsistent data in surveys. We get distracted and bored, taking less care than we should. We change our minds as we think. The interesting thing about MaxDiff is not that it leads to inconsistent data. Rather, it is that it allows us to see that the data is contradictory. This is actually a good thing as, if we had instead, for example, asked the respondent to rank the data, it would still have contained errors, but we would never have seen them as we would have no opportunity to see the inconsistencies.

To summarize:

Computing scores for each respondent by summing up the best scores and subtracting the worst scores is not valid.
We do not have enough data to get a complete ordering of the alternatives.
Respondents provide inconsistent data.

Fortunately, a bit of statistical wizardry can help us with these problems.

The magic – latent class analysis

The problem of respondents providing inconsistent data is not new. It has been an active area of academic research since the 1930s. The area of research that deals with this is known as random utility models, and if you are reading this post you may already be familiar with this class of models (e.g., multinomial logit, latent class logit, random parameters logit, are all models that solve this problem).

The second part of the problem, which is that we have incomplete data, is solved by borrowing data from other respondents. Surprisingly to me, even when there is sufficient data to compute preferences for each respondent separately, it is usually still better to estimate preference by combining their data with that of similar respondents. I think that this is because when we analyze data of each respondent in isolation, we over-fit, failing to spot that what seemed like preferences were really noise.

These two problems are jointly solved using latent class analysis. The special variant that I illustrate below is latent class rank-ordered logit with ties. It is an exotic model, specially developed for latent class analysis. There are other latent class models that can be used. I am not going to explain the maths. Instead, I will just explain how to read the outputs.

Latent class analysis is like cluster analysis. You put in a whole lot of data, and tell it how many classes (i.e., clusters) you want. The table below shows the results for five classes (i.e., segments). The results for each class are shown in the columns. The size of the class is shown at the top. Beneath is the Probability %, also known as a preference share (i.e., the estimated probability that an person in the segment will prefer an alternative from all the alternatives in the study).

Create your own MaxDiff Latent Class Analysis!

Class 1 consists of people that have, on average, the preference ordering of Samsung > Google > Microsoft > Sony > ... . It is 21.4% of the sample. Class 2 consists of people with a strong preference for Apple. Class 3 consists of people that like both Apple and Samsung. People that prefer Sony and Nokia appear in Class 4, but have no super-strong preferences for any brand. Class 5 is also preferring Apple, then Microsoft.

If you look at the Total column you will see something that may surprise you. Google's share is only 12.8%. It is less than Samsung. This contradicts the conclusions from the earlier counting analyses which showed Google as the second most popular brand based on the number of times it was chosen as best, and neck-and-neck with Apple once the worst scores were factored in. How is it that the latent class analysis gives us such a different conclusion? The reason is that the earlier counting analysis is fundamentally flawed.

Looking again at the latent class results, we can see that Google has a moderate share in all of the segments. In this experiment, each person completed six questions. The number of times they chose each of the brands as best across those questions is shown below. The way the experimental design was created is that each alternative was shown only three times. If you look at the 3 times column in the table below, it shows that 36% of people choose Apple best 3 times, 20% chose Samsung 3 times, and 12% chose Google best 3 times. So, we can conclude that Apple is around 3 times as likely to be most preferred compared to Google. Now look at the Once and Twice columns. Google is the most likely brand to be chosen once. And, it is also the most likely brand to be chosen twice. So, Google is the most popular fallback brand. This highlights why the crude counting analyses can be so misleading. People are asked to make 6 choices, but the experimental design only shows them their most preferred brand 3 times, and the counting analysis thus over-inflates the performance of second and third-preferred brands.

Create your own MaxDiff Latent Class Analysis!

In the five-class solution above, only Apple clearly dominates any segment. This is not an insight. Rather, it is a consequence of the number of classes that were selected. If we select more classes, we will get more segments containing sharper differences in preference. The table below shows 10 classes. We could easily add more. How many more? There are a few things to trade-off:

How well our model fits the data. One measure of this is the BIC, which is shown at the bottom of the latent class tables. All else being equal, the lower the BIC the better the model. On this criterion, the 10-class model is superior. However, all else is rarely equal, so treat the BIC as just a rough guide that is only sometimes useful.
The stability of the total column. If you compare the 10 and 5 class solution, you can see that they are highly correlated. However, it is the 10 class solution that is the most accurate estimate (for the more technical readers: as the model is non-linear, the total column, which is a weighted sum of the other columns, is invalid when the number of classes is misspecified).
Whether the brands of interest to the stakeholder get a high preference score in any of the segments. For example, in the table below, there is lots of interest in Apple, Samsung, Sony, and Google, but if you were doing the study for another of the brands, you would probably want to increase the number of classes to find a segment that will resonate with the client. Provided that the BIC keeps decreasing, there is nothing dodgy about this.
The complexity of the solution for stakeholders. The fewer classes, the more intelligible.

The donut chart below displays the preference shares for the 10-class solution (i.e., its Total column).

Create your own Donut Chart!

Profiling latent classes

Once we have created our latent classes, we allocate each person to a class and then profile the classes by creating tables. The table below, for example, shows our 5-class solution by product ownership. If you compare this table with the latent class solution itself, you will see that the product ownership lines up with the preferences exhibited in the MaxDiff questions.

Create your own Ranking Plot!

Respondent-level preference shares

Sometimes it is nice to have preference shares for each respondent in the survey. Typically, they are used as inputs into further analyses (e.g., segmentation studies using multiple data sources). Once you have estimated a latent class model these are easy to compute (they are a standard output). However, they are not super-accurate. As we discussed above, there is insufficient information to compute a person's actual preference ordering, so inevitably any calculations of their preference shares relies heavily on the data shared from other respondents, which in turn is influenced by how good the latent class model is at explaining the data. The table below shows the respondent-level preference shares from the 5-class model.

The table below shows the average of the probability percentages computed to for each respondent. They are very similar to the results in the total column of the latent class model, but not quite the same (again, if you are super-technical: this is due to the non-linearity in the computations; a big difference between these would be a clue that the model is poor). The Total column is more accurate than the Mean Probability % column shown on this table.

I have plotted the histograms of the preference distributions for each of the brands below. These distributions are based on our 5-class model. Thus, they are unable to show any more variation in the preferences than were revealed in the earlier analysis. If we used more classes, we would get more variation. However, there are better ways to achieve this outcome.

The table below shows the preference share distributions from an even more complex model, known as a boosted varying coefficients model. (You won't find this in the academic literature; we invented it, but the code is open-source if you want to dig in.) This shows better distributions for each of the brands (wider = better). A more technical blog post that discusses these more complex models can be found here.

The table below shows the preference shares for each respondent from this model. Take a look at respondents 1 and 13, who we examined at the beginning of the post. The first respondent's clear preference for Microsoft and Google, and dislike for IBM, Nokia, and Yahoo shows through, even though some of the ordering has shifted slightly. Respondent 13's contradictory selections have been resolved in favor of Apple, which they selected twice as their most-preferred.

From these respondent-level shares, the Mean Probability % works out as shown in the table below, which again matches the latent class analysis output quite closely.

Preference simulation

Sometimes in marketing applications of MaxDiff, people choose between alternative products. When doing such studies, it can be interesting to understand the preference shares after having removed some of the alternatives. This is super-simple. All we have to do is to delete the columns of the alternatives that we wish to exclude, and then re-base the numbers so that they add up to 100%. Below, I have recomputed the preference shares with Samsung and Apple removed.

Create your own MaxDiff Design!

Summary

Simple analysis methods are invalid for MaxDiff. They lead to grossly misleading conclusions. The application of more advanced techniques, such as latent class analysis, will, on the other hand, give significantly more meaningful results.

If you click here, you can login to Displayr and see all the analyses that were used in this post. Click here for a post on how to do this yourself in Displayr, and here for one on how to do it in Q.

How to Create a MaxDiff Experimental Design in Displayr

Tim Bock — Wed, 17 May 2017 09:46:57 +0000

Creating the experimental design for a MaxDiff experiment is easy in Displayr. This post describes how you can create and check the design yourself. If you are not sure what this is, best to read An introduction to MaxDiff first.

Click here for an interactive tutorial on MaxDiff design

Creating the design

In Displayr, select Insert > More > Marketing > MaxDiff > Experimental Design.
Specify the number of Alternatives in the study. In my technology study, for example, I had 10 brands, so I enter the number of alternatives as 10. The alternatives can be labeled if you wish, or shown as numbers.
Specify the number of Alternatives per question. I tend to set this at 5. Where I have studies where the alternatives are wordy, I like to instead use only 4 alternatives per question. Where the alternatives are really easy to understand I have used 6. The key trade-off here is cognitive difficulty for the respondent. The harder the questions, the more likely people are to not consider them very carefully.
Specify the number of Questions to ask. A rule of thumb provided by the good folks at Sawtooth Software states the ideal number of questions: 3 * Alternatives / Alternatives per question. This would suggest that in the technology study, I should have used 3 * 10 / 5 = 6 questions, which is indeed the number that I used in the study. There are two conflicting factors to trade off when setting the number of questions. The more questions, the more respondent fatigue, and the worse your data becomes. The fewer questions, the less data, and the harder it is to work out the relative appeal of alternatives that have a similar level of overall appeal. I return to this topic in the discussion of checking designs, below.
Specify the number of Versions to ask. Where the focus is only on comparing the alternatives (e.g., identifying the best from a series of product concepts), it is a good idea to create multiple versions of the design so as to reduce the effect of order and context effects. Sawtooth Software suggest that if having multiple versions, 10 is sufficient to minimize order and context effects, although there is no good reason not to have a separate design for each respondent. Where the goal of the study is to compare different people, such as when performing segmentation studies, it is often appropriate to use a single version (as if you have multiple designs, this is a source of variation between respondents, and may influence the segmentation).
Displayr's algorithm includes a randomization component. Occasionally, this can lead to poor designs being found (how to check for this is described below). Sometimes this problem can be remedied by increasing the number of Repeats.
The alternatives for each question can be shown in the numeric order (or the order of the labels), or a random order.

Download our free MaxDiff ebook

Interpreting the design

The experimental design is typically shown as a table. In the example below, each row represents a question. Each column shows which alternatives appear as options in each of the questions. Thus, in the first question, the respondent evaluates Alternatives 1, 3, 5, 6, and 10. More complicated designs can have additional information (this is discussed below).

More complicated designs

Click here for an interactive tutorial on MaxDiff design

I tend to add one additional complication to my MaxDiff studies. I get the data collection to involve randomization of the order of the alternatives between respondents. One and only one respondent had brands shown in this order: Apple, Google Samsung, Sony, Microsoft, Intel, Dell, Nokia, IBM, and Yahoo. So, whenever Apple appeared it was at the top, whenever Google appeared, it was below Apple if Apple appeared, but at the top otherwise, etc. The next respondent had the brands in a different order, and so on.

If doing randomization like this, I strongly advise having this randomization done in the data collection software. You can then undo it when creating the data file, enabling you to conduct the analysis as if no randomization ever occurred.

There are many other ways of complicating designs, such as to deal with large numbers of alternatives, and to prevent certain pairs of alternatives appearing together. Click here for more information about this.

Checking the design

In an ideal world, a MaxDiff experimental design has the following characteristics, where each alternative appears:

At least 3 times.
The same number of times.
With each other alternative the same number of times (e.g., each alternative appears with each other alternative twice).

Due to a combination of maths and a desire to avoid respondent fatigue, few MaxDiff experimental designs satisfy these three requirements (the last one is particularly tough).

Above, I described a design with 10 alternatives, 5 alternatives per question, and 6 questions. If you select the Detailed outputs option, you are shown a series of outputs that allows you to verify that the design showed earlier meets the first two of these requirements, and does an OK job on the last one (each attribute appears either once or twice with each other attribute).

The screen shot below shows Displayr, where I have reduced the number of alternatives per question from 5 to 4. This small change has made a good design awful. How can we see it is awful? The first thing to note is that 6 warnings are shown at the top of the screen (you need to use the scrollbar on the warnings to see all 6).

The first warning is telling us that we have ignored the advice about how to compute the number of questions, and we should instead have at least 8 questions. (Or, more alternatives per question.)

The second warning is telling us that we have an alternative that only appears two times, whereas good practice is that we should have each alternative appearing three times.

The third warning tells us that some alternatives appear more regularly than others. Looking at the frequencies output, we can see that options appeared either 2 or 3 times. Why does this matter? It means we have collected more information about some of the alternatives than others, so may end up with different levels of precision of our estimates of the appeal of different alternatives.

Click here for an interactive tutorial on MaxDiff design

The fourth warning is a bit cryptic. To understand it we need to look at the binary correlations, which are shown below. This correlation matrix shows the correlations between each of the columns of the experimental design (i.e., binary.design shown above). Looking at row 4 and column 8 we see a big problem. Alternative 4 and 8 are perfectly negatively correlated. That is, whenever alternative 4 appears in the design alternative 8 does not appear, and whenever 8 appears, 4 does not appear. One of the cool things about MaxDiff is that it can sometimes still work even with such a flaw in the experimental design; however, it would be a foolhardy person that would rely on this, because the basic purpose of MaxDiff is to work out relative preferences between alternatives, and its ability to do this is clearly compromised if some alternatives are never shown with others.

The 5th warning tells us that there is a large range in the correlations. In most experimental designs, the ideal design results in a correlation of 0 between all the variables. MaxDiff designs differ from this, as, on average, there will always be a negative correlation between the variables. However, the basic idea is the same: we strive for designs where the correlations are close to 0 as possible. Correlations in the range of -0.5 and 0.5 should, in my opinion, cause no concern.

The last warning tells us that some alternatives never appear together. We already deduced this from the binary correlations.

Download our free MaxDiff ebook

Checking designs with multiple versions

When you set the number of versions to more than 1, this will not change any of the warnings described in the previous section. All of these warnings relate to the quality of the design for an individual person. Increasing the number of versions improves the design for estimating results for the total sample, but this does not mean the designs change in any way for individual respondents. Thus, if you are doing any analyses at the respondent level, changing the number of versions does not help in any way.

Additional detailed outputs are provided when using multiple versions, which show the properties of the design as a whole. These show the binary correlations across the entire design, and the pairwise frequencies. There interpretation is as described above, except that it relates to the entire design, rather than to the design of each version.

How to fix a poor design

The first thing to do when you have a poor design is to increase the setting for Repeats. Start by setting it to 10. Then, if you have patience, try 100, and then bigger numbers. This only occasionally works. But, when it does work it is a good outcome.

If Repeats does not work, you need to change something else. Reducing the number of alternatives and/or increasing the number of questions are usually the best places to start.

To find out more about MaxDiff and other important market research techniques, check out "Market Research"!

Click here for an interactive tutorial on MaxDiff design