Statistical analysis for conjoint

So, you've collected your conjoint data, what next?

In this webinar you will learn

Here’s a little summary of some of the subjects we cover in this webinar

The next step is to do statistical analyses to compute the utilities.

This webinar will take you through the best practices in statistical analysis for conjoint, from how to choose the right model to how to check the results.

This is the third webinar in our series on conjoint, so please check out our earlier introduction and experimental design webinars.

The next webinar in the series is about how to extract insight and report your conjoint data.


This webinar will show you how to perform the statistical analysis side of conjoint data. That is, how to calculate the utilities. We will start by reviewing the goal of statistical analysis. We will then move onto getting data into Displayr. Then, we will work through the six stages of creating a good model.

My focus is on the general principles, but I will be showing things in Displayr. Q works almost identically.



The statistical analysis of conjoint is all about utilities. 35 years ago, when choice based conjoint was invented, the goal was to estimate one utility for each attribute level. As we have talked about in previous weeks, we typically set the first level of an attribute to 0 and estimate everything relative to it.

Here, we can see that relative to Godiva, Dove's appealing, Hershey even more appealing, and Lindt a bit less appealing.

Looking at cocoa strength, people like milk chocolate the most. It's 0 because we set the first level to 0. Dark's less appealing and 70% cacao is even less appealing. Finally, white chocolate is the least appealing, with a utility of -1.3.

We can't compare the cocoa utility numbers to the brand utility numbers. We compare only levels of the same attribute. That Hershey is higher than Milk, for example, tells us nothing. From this information alone it's possible to predict market share and draw lots of conclusions.


These utilities…

About 20 years ago choice-based conjoint experienced a major jump forward. It became the norm to compute a utility for every respondent for every attribute level.

Looking here, the utility for white of -1.3 is an average. We can see that the first person's utility is 1.4 for white, meaning she prefers white chocolate to milk. The second person's utility is -3. That person strongly prefers Milk to white chocolate. And so on. By getting utilities for each respondent we can conduct much more fine-grained analyses. But, if you pause for a moment and think about it, you will realize that there must be a lot of noise in utilities that are estimated for each person. If we ask them 6 questions, as in this experiment, where they just choose from a set of alternatives, how can we so precisely compute their utility? We can't.


Each respondent's utility is itself…

The way that the modern techniques for calculating utilities work is that they identify, say, 100 or 1000 or some other number of possible values for each respondent for each attribute level. These possible utilities are called draws. So, while we have concluded that respondent 7 has an average utility of -3.4, it's possible their utility is 0.2, -2.9, -4.2, or any of the 100 numbers. The average of these 100 numbers is out best single guess. But the smart thing is to recognize we have uncertainty and to use this when drawing final conclusions. The goal with modern choice based conjoint is to estimate these draws. That is, for each attribute level we want to estimate the full range of possible utilities that each person can have. These numbers are then used to derive all the key outputs that we will discuss next week.


Getting the data into Displayr

This is the easy bit. How do we get the data into Displayr.


Setup instructions

We need two things to calculate utilities. We need the experimental design and a data set. All the detail is here but, let's go and do it.

We will create a new document.

As we talked about in the previous webinar, the experimental design looks something like this. While you can always use an Excel file for your design, there are other options.

You can use a design created in Dsiplayr or Q. You can use both the sawtooth CHO and dual file formats. You can use the JMP format. And, we've just added in a feature for setting up Survey Gizmo conjoint studies as well. And, for more exotic formats, we there is the option of setting up an Experiment variable set.

We will start by importing the experimental design.

And now I will bring in the data set


In Displayr:

Add data set … > Chocolate Respondent data


These data files need to be created so that they can talk together. That is, the version or task variables need to match up. For example, if you look at the Task variable in the experimental design, it's showing the task numbers. We've got 4 1s, as there are 4 alternatives for the first ask. And, we can see that respondent 1, in the respondent data file, did task 469, followed by 470, etc.


Hygiene test

Now we move onto calculating and then checking our utilities. Many things that we calculate, such as percentages and means, have simple exact formulas. But, more complicated analyses have a lot of trial and error hidden under the hood.


An analysis is said to have converged when we are confident that we have done enough trial and error and have arrived at a good result.


This is a test of hygiene in much the same way as using hand sanitizer. There's no guarantee you die if you don't use it. But, you're more likely to. Testing the convergence of HB models is something that is an active area of academic research, and we know a lot more than we did today than even five years ago.

Let's start by creating our first model.


In Displayr:

Insert > More > Choice Modeling > Hierarchical Bayes


We just need to hook it up to the design and the data we've collected.

If we are looking to do a quick job, we're done. It will take a minute or two to finish computing.

We're now at the level where you would get if you used a well written black box tool.

The rest of the webinar we will be digging deep and looking at more advanced issues. If you just want to do the basics, you can log off now and wait for the next webinar on reporting, where we will look at how to figure out how to interpret and report the results

But, I'm going to reduce the number of iterations so you can see what happens when you've got a model that's not converged.


In Displayr:

Model > Iterations > 5


An iteration is the number of attempts at trial and error. Please be aware that an iteration is defined differently in different programs. E.g., one of our iterations is about 100 in Sawtooth.

  1. We've got some warnings And, just like with everything else I have shown you, Displayr will coach us

As you can see, we've got lots of weird technical warnings. So, we've failed the hygiene test.

What do we do? We follow Displayr's advice. It's telling us to run the model with more iterations.

I've already re-run the models with all the recommended iterations. Let's have a look.


Hierarchical Bayes - 100 iterations (the default)

Here's the result with the default number of iterations. The warning tells us that we need more iterations.


Hierarchical Bayes - 1000 iterations

Here's the result with 1,000 iterations. No warnings. Our model is clean.


The good hygiene dividend

The output on the left is from the default settings. The output on the right is with all the hygiene issues resolved. The results are almost identical. If you compare the models using cross-validation predictive accuracy, which I will soon explain, we find that the accuracy goes from 47.6% to 48.9%, so there is a benefit. But, it's not huge.

Our experience is that this is usually the case. But, take note of the topic. Hygiene. You can get into trouble if you ignore it.


Smell test

Now, let's saying you are using software, that doesn't have hygiene tests. Well, as we've learned recently, not testing for coronavirus is not a good protection! Fortunately, unlike with COVID 19, you can do the smell test. And, you need to do it even if the hygiene tests are passed.

Let's go back to our first model, where we only had 5 iterations and it hadn't converged. We know it was bad as we didn't have enough iterations. Can we work it with the smell test?

We are going to start by putting a diagnostic on

Model > Questions let out for cross-validation: 1

In this study, each person answered 6 questions. What I've done now is that we only use 5 of the questions to fit the model, with a different 5 selected for each person.

The result shown here is then the predictive accuracy for the 6th task. This is sometimes called the holdout accuracy in market research. But cross-validation more generally. What it says is that the model predicts choice 24.1% of the time. As there are four alternatives, then we've done a bit worse than chance! But, remember, this is the model that has not converged. It's only got the 5 iterations.

An even better metric than the accuracy is the RLH, which standard for root likelihood. What makes it better is that it takes into account both whether the choice was predicted accurately, and, how confident the model was in its prediction. So, if the model is sure that a person will choose option 1, but they choose option 2, this is heavily penalized. Whereas if the model thinks each option has a probability of 25%, then the penalty is less.

Like with accuracy, if the model was as good as chance, it would have a result of 0.25. And, this models' no better than chance.

OK, so this model smells. We also want to see if the results look plausible.

Let's look at price. We are looking at the mean utility here. This tells us that people don't want to pay $1.49 as much as 99 cents. That makes sense, but the average utility of $1.99 is the same as 99 cents, and higher than $1.49. That makes no sense at all. So, this model fails the smell test.


Reordering the utilities

You will remember me talking about how the first attribute level is set to 0, and the other levels are relative to it. I find it helpful to change the order, to make key results clearer.

Look at cocoa strength. I want to use Milk as the benchmark.

Now, Milk is the benchmark, and everything's relative to it.

Let's go back to our model that passed the hygiene test.

Look at the price results. What looked wrong before is now interesting. People prefer to pay $0.99 to $1.49, but they are indifferent between $1.49 and $1.99. So, only a fool would sell at $1.49. And, at $2.49, the appeal is lower, but there's a lot of variability.

We talked a bit about the RLH statistic before. A value of 0.25 is consistent with random choice. This result here is telling us that on average, people are choosing a lot better than random. But averages hide lots of problems.

Choice questions can be boring. There's a good chance of garbage data. We want to remove people who are answering the data randomly. We will write a tiny but of code to work out the distribution of RLH.

One of the cool things about Displayr, is that you can calculate anything. The general way of doing this is by creating an R Output.


In Displayr:

Insert > R Output


Our Hierarchical Bayes is called choice model 2. To see what data it has in it we use names.

Cool. The model has something called RLH. Let's look at it.

So, these are the RLH values of the respondents.

Let's plot that.

  1. So we do have a few people below random. That is, below 0.25. Let's look at some percentiles.

Now, the question we have to ask is, if people were randomly responding, what would our percentiles look like? The answer to this varies study by study. Fortunately, we can simulate it easily. We will duplicate our whole page

We will tell Displayr to simulate some data.

By default, it simulates data with a mean of 0, but you can add priors in if you want. This will take a while to compute, as it's doing all 1,000 iterations. We'd be fine with a 100. To save time I've done this before the webinar. So, I'll delete this page to save on computing costs.

This shows us that with fully random data, we can 95% have a value of up to 0.35. So, this should be our cut off. That is, any value of 0.35 or less may be poor quality data.

Now, to filter the model by RLH, we've got three steps. First, we need to save the RLH as a variable into the data file. Then, we need to duplicate our model.

Then we just apply a filter to the model


In Displayr:

New > Filter

Data: Search for RLH

RLH from …

Change to 0.36 to 0.83


This will filter out 8% of the sample.

If I click Create filter it will run. But I've done this earlier.

As you can see, by getting rid of the dodgy respondents, we've actually improved the RLH for the whole model, and our lowest is now 0.41.


Remove random choosers

So, we just looked at how to remove random choosers.


Remove irrational choosers

There's another class of people we might want to get rid of. That's people that are making chooses that don't make sense at all. It's not always a smart thing to do. But it can be. It's a thin edge of the wedge type of problem and needs to be thought through on a case by case basis.


SIMULATION > iterations saved per individual: 100

As an example of irrationality, let's look at price. There are a small number of people that prefer $2.49 to $0.99. We can see this by the small blue bars.

Now, as researchers, we can all explain any weird result. But, remember, this is price after taking into account brand and other attributes, so it really makes little sense.

We could delete the respondent with a utility of more than 0 for a price of $2.49, but this would be the wrong thing to do. Why wrong?

As discussed before, we have uncertainty in the data. There will be some people who don't care about price, so they may have a marginally positive price just due to noise. If we delete these people, we would accidentally be forcing the model to make price more important than it is.

So, what we need to do is to first get the draws for each respondent.

To do this we first need to tell Displayr to save the draws. By default it doesn't do it.

This is done here:

SIMULATION > iterations saved per individual

I've set this to 100.


Checking price irrationality

This bit's a bit more heavy duty. Very few people do this. But, it's the righteous thing to do.

I've created a table here which shows for the 374 respondents with valid data, the distribution of their beta draws for a price of $2,49. What you can see is that the median for most of them is negative. But, pretty much all of them have some positive values. Remember, as discussed before, we do have a lot of uncertainty with these models when we do calculations at the respondent level. If we sort by the median, we will get a better idea of what's going on.

So, this first page shows people with positive medians. That is, who made choices consistent with them preferring higher prices.

We've got about 40 people with positive medians. So, the naïve thing would be to delete these 40 people. But, a safer approach would be to delete people who have predominantly got positive price draws. To appreciate this, let's sort by the 5% percentile.

OK, so looking at the 5th percentiles, we can see that the person with the highest value has a negative.  We're actually doing what's called a bayesian significance test. And, what we can see here is that there is nobody who, at the 5% level of significance, we can conclude has irratinoal price data.

So, we've passed this test. If we didn't' we'd filter just like we did before.


Check external validity

External validity is a term of art in research, which deals with whether the results will hold up in the real world.


Different levels of external validity

The simplest check, is whether we can predict above random. We have passed this, but it's a pretty weak test.

The next level up is whether the utilities we've estimated correlate with other data. In particular, we would expect that people with higher incomes should be less concerned about price. That is, our utility for $2.49 should be positively correlated with income. And, we would expect that diabetics will have a greater utility for sugar free chocolate.

To work this out, I will start by saving the utilities. Now, utilities are also known as coefficients. And, there are lots of different ways of calculating them. I'll talk about them next week. For the moment, I will click on Save individual-level coefficients.

Income's currently stored as categories. We need to recode this using midpoints.

OK, so our model's promising. What would be ideal? Checking it against real world behavior. For example, can it predict changes in market share accurately over time? Sales of new products? Often the easiest way to check is to compare sales by geographic market or demographic group.


Choose the best model

And now for the last stage. Choosing the best model.


Key decisions…

One decision is how to treat price. We can treat it as separate price levels. Or, we can treat it as a numeric attribute, which means assuming that there's a straight line that passes through the price points. There are pros and cons. I'll return to it.

Should we include covariates? What prior covariance distribution should we use?


Best model so far

So here's our best model so far. I've done the cross validation again, so our predictive accuracy is 48.4% and the RLH is .400. Remember these numbers. 48.4 and .400.


Numeric price attribute

Here I have used a numeric price attribute. I clicked this option and entered in the values I wanted to use.

We've got both higher cross-validation accuracy and RLH when using price as numeric.

This is a good thing and a bad thing: The good thing is that we've found a better model.The bad thing was that the old model had this nice nonlinearity in it. So, we will end up with some inconsistencies at reporting time. More about that next week. Note our benchmark accuracy is now 50.5, and the RLH is now .409



Covariates are additional variables, that are used to improve prediction. Theoretically you can use anything. But, using demographics rarely helps. You want data that has a logical relationship.

In our case, the data that makes sense to include is the data on income and diabetes.

I just dragged them to here. But, as you can see, the predictive accuracy is worse and the RLH is better. So, we have a conflicting result. Which should we use? Simple is better so I wouldn't use the covariates here.


Hierarchical Bayes (HB) versus Latent Class Analysis

We never have enough data to accurately estimate the utilities for a single person. We fix this problem by pooling data (i.e., analyzing all people's data at the same time). We have to make a choice about how we believe people differ (our covariance prior). The four most common assumptions are:

  • Everybody is the same (Multinomial Logit)
  • MODEL > Type: Multinomial Logit

There are a small number of groups of people (LCA)

  • MODEL > Type: Latent Class Analysis, Number of classes: 2 or more

The population is multivariate normal (the simplest widely used HB model)

  • MODEL > Type: Hierarchical Bayes, Number of classes: 1

There are multiple overlapping groups of people, each containing a multivariate normal distribution (Mixture of normals)

  • MODEL > Type: Hierarchical Bayes, Number of classes: 2 or more

Recommended approach

  • Start with the standard HB model (multivariate normal)
  • See if you can improve on it with LCA
  • If you can improve with LCA, try mixture of normal (mixture of normals is theoretically best, but can take days...)


Latent class analysis

Here's one I did before. I've set this to 2 classes. So, the RLH and accuracy are much worse. is still well below the HB. So, our best model is the one with the covariates.


Want to learn more about how doing conjoint in Displayr can halve your analysis and reporting time? Book a personalized demo with us and we'll show you how!

Read more

Cookies help us provide, protect and improve our products and services. By using our website, you agree to our use of cookies (privacy policy).

Conjoint streamlined in one place

See how!