Advanced Analysis - Displayr

Fast track categorizing and coding text data

Andrew Kelly — Mon, 20 Mar 2023 02:18:54 +0000

Overview

Displayr's text coding functionality is designed with needs of the survey researcher front and centre. For many years the text categorization functions in Displayr have already supported what we might call a manual workflow. We make it easy to view, sort, and filter text responses, create and structure categories, and assign or code responses to those categories. More recently we've added semi-automated functions to the interface and extensively upgraded the algorithms that drive them. We believe our tools in this space are state-of-the-art ...

In selecting "Semi-Automatic" Text Categorization, users are presented immediately with a draft set of categories with the bulk of the data already coded. So in a matter of minutes you are off to a great start
The algorithms that create this output are based on analysing context and meaning (not word similarity, like many other tools). Your draft code frames are intuitive from the get-go
We've made this work effectively for multiple response categorizations, (where responses can be assigned to more than one code), which are historically more challenging to automate
For tracking and related research, we have specific algorithms that recognize and categorize unaided brand awareness questions
Once you have your draft categories, the user interface makes it easy to edit, with tools to combine, rename, and split categories

So the workflow now becomes:

Let Displayr do the hard work and get you most of the way there (via a draft categorization), but in a fraction of the time it would take manually
You then fine tune and edit the categories via the intuitive user-interface.

Accessing the automated functions

The quickest way to do this is to select a text variable in the Data Set tree, hover above or below it to '+' insert a new variable, and follow the prompts via the Semi-Automatic menu path:

We know some users might want to start the process manually. This could involve reading through some responses and create some pre-planned categories. Even if you follow the Manual menu path, you can access the Automatic categorization function. At any time you can speed up the coding of remaing uncategorized data. In the categoriztion interface, set "Sort by:" to Fuzzy match, (as matching is a key building block of the algorithm), and the "Auto" button appears:

The functionality and workflow in action

Take a look at the process in action in this short video. It uses an open ended question on how people feel about 'Tom Cruise' as input*

You can get a broader overview of text analysis methods and solutions in this webinar recording. How to quickly analyze text data

Streamline your text data analysis.

The process of turning open text responses into usable data is traditionally time consuming and expensive (being often outsourced). Displayr's text categorization tools are state of the art. You can create a draft categoriztion in minutes automatically and then quickly fine tune it into a polished codeframe. If you use a lot of text data and want to know more, book a demo or take a free trial.

* Discretion is advised - the data used in the video is from a real survey containing unvarnished attitudes to Tom Cruise. Some respondents have written unkind, distasteful and potentially offensive things. Displayr does not condone or endorse any of the comments that have been made.

Save time translating and coding text data

Andrew Kelly — Wed, 21 Sep 2022 01:42:29 +0000

While there are several ways to translate text data in Displayr, our text categorization function is made even more powerful by having Automatic Text Translation built into the interface. When you insert a new text categorization variable (+ > Text Categorization > .... > New - more on this in the video below), you are given the option to Translate the text:

Use any Source and Output language.

You are then prompted to select the Source language:

Automatically detect language
Specify with variable (use this option if the source language is identified by a variable in your data set. This option is particularly useful if your file contains multiple languages)
A specific language - the default language is English.

You can set the Output language here as well.

Text Translation in Action

Consider a simple scenario where data has been collected on hotel reviews and it includes a "comment" option - guests of course need to complete the survey in a language they are comfortable with:

The data file will have multiple languages, and (typically) an additional variable classifying the language selected.
The person responsible for categorizing (coding) the data will want to do so in their language
The outputs, being the categories (or code-frame), will also need to be in their language.

Displayr makes all this very easy, including dealing with multiple language inputs simultaneously. And once translated, you can create an intial catergorization (code frame) automatically.

Take a look at the process in action in this short video, covering both single and multiple language translation ...

Streamline your text data translation and analysis.

The process of translating text data into the analyst's language is traditionally time consuming and expensive (being typically outsourced). Displayr's translation tools are now available directly in the text categorization interface - you can even create a draft code frame in your preferred language automatically. If you use non-native language text data and want to know more, book a demo or take a free trial.

Learn More about Text Analysis in Displayr

Matt Munley — Fri, 30 Oct 2020 22:26:07 +0000

Text Analysis in Displayr - General Resources

These are the best places to start to learn about text analysis in Displayr.

General Categorization (Coding) Resources

Automatic Categorization

Manual & Semi-automatic Categorization

Sentiment analysis

Word Clouds

Other Uses of Text Data

Manipulation of Text Variables

Working with Principal Components Analysis Results

Tim Ali — Tue, 15 Sep 2020 00:28:57 +0000

Principal Components Analysis (PCA) is a technique for taking many variables and creating a new, smaller set of variables. These aim to capture as much of the variation in the data as possible. In this post, we show you how to save, access, and export the PCA results and output. For information on how to set up and run the PCA, see How to Do Principal Components Analysis in Displayr.

Principal Component Loadings

The default PCA output is the Principal Components Loadings table which shows one row for each of the original variables. From the same example used in How to Do Principal Components Analysis in Displayr, each of the 8 new variables or components identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the original variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. To display these values, deselect the Suppress small coefficients checkbox.

Saving Component Scores

To save a set of respondent level component score variables from the PCA output, select:

Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions

This creates a set of variables for each component at the top of the Data Sets tree grouped together as a question called Scores from dim.reduce. These scores are standardized respondent level component scores with a mean of 0 and standard deviation of 1 across the entire sample. You can then rename the component variables based on the attributes to which they most closely correlate. To do this, select each of the component variables group under Scores from dim.reduce in the Data Sets tree, right-click, and select Rename.

The new variables are linked back to your PCA output. This means that if you change any of the input options and then calculate the PCA again, the scores will also update automatically based on the updated analysis. If you change the number of components in the analysis, you should delete the variables for the scores in the Data Sets tree and save a new set of scores.

As an alternative, you can also save the component score variables as follows:

1. From the Insert menu, select R > Numeric Variable
2. In the R CODE field, paste in the code here (where dim.reduce is the name of the output that you've previously created):

[sourcecode language="r"]
fitted(dim.reduce)
[/sourcecode]

3. Click the Calculate button to run the code.
4. Allocate a Question Name and Label in GENERAL.

Exporting PCA Results

To export the Rotated Loadings table, select the PCA output and then from the menu select Export > Excel. Select Current Selection and then click the Export button. An Excel file containing the loadings table will be exported.

You can also generate an R output of the loadings table by selecting Insert > R Output (in the Analysis group) from the menus, then enter the following R code and click the Calculate button.

[sourcecode language="r"]
dim.reduce$rotated.loadings
[/sourcecode]

This will generate an unsorted R table containing the loading coefficients which can also be exported to Excel. You can adjust the number of decimal places using the decimal options on the Appearance menu. Note that this is based on a PCA name dim.reduce which is the default PCA object name in Displayr. If you've renamed your PCA analysis, you'll need to make the change in the code as well.

If you instead want to export the respondent level component scores, you can do so by creating a raw data table and then export this to Excel. To do this, from the menu select Insert > More > Tables > Raw Data. Next, select each of the component scores from the Variables drop-down list in the Object Inspector. Click the Calculate button to generate the output. This output can now be exported by selecting an option from the Export menu.

Learn More About MaxDiff

Belinda Bailey — Thu, 10 Sep 2020 08:58:58 +0000

This is a guide for everything you need to know about MaxDiff. It covers the “what is?” and the “how to…” of different approaches to the analysis, from preference share to profiling latent classes and finally how to interpret the analysis. There are worked examples, shown in Displayr and R.

Introduction

What is Max Diff? Have you ever needed to know what your customers prefer? MaxDiff (otherwise known as Best-Worst Scaling) quite simply involves respondents indicating the ‘Best’ and the ‘Worst’ options out of a given set so that we can gain an understanding of consumers’ preference choices. This can be from how they feel about specific brands or products and to know which features make a difference and are of value to the consumer.

A Beginners Guide to MaxDiff MaxDiff is a research technique for measuring relative preferences.

How MaxDiff Analysis Works This post explains the basic mechanics of how preferences can be measured using the data collected in a MaxDiff experiment.

DIY MaxDiff

Make MaxDiff a piece of cake Quickly go from experimental design to analysis, to interactive report in one tool. Displayr automates all the painful manual steps and makes it easy to deliver best-in-class results, even for the novice.

The 4 easy steps that’ll make any good researcher proficient at MaxDiff This webinar is for market researchers and consumer insights people who analyze data (from novice to expert).

11 Tips for your own MaxDiff Analysis This post shares 11 tips to help researchers be self-sufficient in MaxDiff analysis.

DIY MaxDiff eBook This ebook will take you from generating experimental designs to conducting advanced Hierarchical Bayes analysis.

Interactive Tutorials

Creating an Experimentation Design for MaxDiff This interactive tutorial demonstrates how to create a MaxDiff Experimental Design

Analyzing MaxDiff Data This interactive tutorial demonstrates the range of purpose-built tools for analyzing the data from MaxDiff experiments available in Displayr.

Experimental Design

How to Create a MaxDiff Experimental Design in Displayr This post describes how you can create and check the design yourself.

Advanced MaxDiff Experimental Designs A MaxDiff experimental design creates multiple lists of alternatives to be shown to respondents in multiple questions.

Pairwise Balanced MaxDiff Designs This post gives some background on how MaxDiff designs are created, including a new method for making multiple version designs that are pairwise balanced.

How To Check Experimental Design This post explains the basic process followed when doing a rough-and-ready check of the experimental design.

Data File Formats

Custom Data Files - Survey Gizmo MaxDiff This QScript makes it possible to analyze Alchemer (formerly Survey Gizmo) MaxDiff data in Displayr.

MaxDiff Data File Layouts There is no standard way of laying out the data from MaxDiff experiments. The following descriptions encapsulate some of the common ways.

Statistical Analysis

How to Analyze MaxDiff Data in Displayr This post discusses a number of options that are available in Displayr for analyzing data from MaxDiff experiments.

Setting Up a MaxDiff Experiment as a Ranking There are some more 'exotic' types of analysis of MaxDiff data where it is useful to have the data set up as a Ranking Structure.

Counting Analysis of MaxDiff Data This post explains how to do Counts analysis of MaxDiff data.

Comparing MaxDiff Models and Creating Ensembles in Displayr There are a variety of different models available in Displayr to perform MaxDiff analysis. This post describes how to easily compare the models. It also demonstrates how to create an ensemble that combines the models and potentially improves prediction accuracy.

The Accuracy of Hierarchical Bayes When the Data Contains Segments This post explores the implications of using Hierarchical Bayes with data that contains segments.

Using Hierarchical Bayes for MaxDiff in Displayr This post describes how to run Hierarchical Bayes for MaxDiff in Displayr, and explain the options and outputs available.

Checking Convergence When Using Hierarchical Bayes for MaxDiff This post discusses technical information about how to check for convergence in a Hierarchical Bayes MaxDiff model.

Comparing Tricked Logit and Rank-Ordered Logit with Ties for MaxDiff This post compares two ways in which MaxDiff data is treated in analyses such as latent class analysis and Hierarchical Bayes.

Using Cross-Validation to Measure MaxDiff Performance This post compares various approaches to analyzing MaxDiff data using a method known as cross-validation.

Comparing MaxDiff Results from Different Packages This post lists the main reasons why you may get different results with different packages.

MaxDiff Mixture Models This post discussed the main mixture models used to analyze the MaxDiff experiments.

Anchored MaxDiff Anchored MaxDiff experiments supplement standard MaxDiff questions with additional questions designed to work out the absolute importance of the attributes.

Case Studies

MaxDiff Analysis in Displayr, a Case Study This case study illustrates an advanced analysis of experimental data in Displayr.

Case Study: MaxDiff - Presidential Traits This case study formed the 4th and final part of the webinar DIY Market Research Dashboards - Building 4 in 40 minutes (webinar).

Commander-in-Chief MaxDiff An alternative to PowerPoint, story-style dashboard showing an analysis of what Americans desire in their Commander-in-Chief.

Learn more

If you can't find something you can always ask the technical support team, who love to help. Just email support@displayr.com. Happy learning!

Learn More about Dimension Reduction in Displayr

Tim Ali — Wed, 09 Sep 2020 04:00:43 +0000

Correspondence Analysis

Webinar: DIY Market Mapping Using Correspondence Analysis

Ebook: DIY Correspondence Analysis

How Correspondence Analysis Works (A Simple Explanation)

Understanding the Math of Correspondence Analysis

How to Interpret Correspondence Analysis Plots

Correspondence Analysis Versus Multiple Correspondence Analysis

Principal Component Analysis

Principal Component Analysis (Wiki example)

How to Do Principal Components Analysis in Displayr

The Basic Mechanics of Principal Components Analysis

Principal Component Analysis of Text Data

Varimax Rotation

Component Score Coefficient Matrix

Kaiser Rule

Determining the Number of Components in Principal Components Analysis

Validating Principal Components Analysis

Common Misinterpretations of Principal Components Analysis

Text Analysis - Advanced - Principal Components Analysis (Text)

Saved Principal Components Analysis Variables

Multidimensional Scaling and t-SNE

What is Multidimensional Scaling (MDS)?

t-SNE

How t-SNE Works

Goodness of Fit in MDS and t-SNE wit Shepard Diagrams

How to Do Principal Components Analysis in Displayr

Tim Ali — Wed, 02 Sep 2020 01:09:25 +0000

Data setup

Principal Components Analysis always views data numerically. This means that you need to be careful with the question Structure assigned to your variables to ensure the analysis views their numeric values. The variables in a PCA should be part of a Numeric, Numeric - Multi, or Binary - Multi question.

In most cases, you should set your variables up as Numeric or Numeric - Multi. The variables do not need to be grouped together. Remember, they could come from different questions, but they should all be on the same scale (that is, don’t mix 5-po int scales with binary variables or 10-point scales). Binary - Multi is appropriate to use when the data are binary.

If your variables are not set up as Numeric, Numeric - Multi, or Binary - Multi, you can:

Locate the variables in the Data Sets tree.
(Optional) Make new copies of the variables by selecting them, and from the menu choosing Home > Duplicate.
From the Object Inspector on the right side of the screen, change the Structure to either:
1. Numeric, if there’s a single numeric variable,
2. Numeric - Multi, if you have multiple numeric variables that are grouped together, or
3. Binary - Multi, for binary variables.

In this article, I am using an example of a 5-point scale (called “Q23. Attitudes”). We asked several statements about our respondents' mobile phone use. Originally, the variables were set up as a Nominal - Multi question, which is typically how looped scales like this will appear in Displayr. In my screenshot below, I made a copy of the question for use in the PCA, and then set the Structure to Numeric - Multi.

Creating the Principal Components Analysis

To create the PCA in Displayr:

Select Insert > Dimension Reduction > Principal Components Analysis.
In the Object Inspector on the right side of the screen, choose the variables that you want to analyze in the Variables box.
Tick Automatic, which ensures the PCA will remain up to date when the data changes or when you change the settings.

The output from the PCA is what is known as a loadings table. This table shows one row for each of my original mobile phone statement variables (there are 23). Each of the 8 new variables identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the old variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. Change this by toggling the Suppress small coefficients box.

The table is sorted in a way that makes it easy to work out what the 8 new variables mean. The first variable (“Component 1”) shows a strong correlation with the variables for “Want to view videos”, “Want video phone”, “Want to play music”, “Like fast internet on phone”, and “Do mobile banking”. We conducted this study before the age of the smartphone. At the time, these higher-technology features were uncommon in phones.

This new variable thus represents an underlying factor of desire for better technological capabilities in phones. The second variable strongly correlates with variables that reveal a desire to stay in touch and connected. The third variable represents an attitude that phones need only make calls or have basic functionality, and so on.

The output also tells us a number of key bits about the analysis:

The 8 components represent 57.7% of the original variance in the data. You inevitably lose some information when you reduce variables like this.
The first variable (“Component 1”) accounts for 12.8% of the variation. The second accounts for 8.63% of the variation, etc. The sort order goes from most variation to the least variation.
The footer contains additional sample size information and settings info.

In the next few sections, I’ll explain some settings that we didn’t change, and how to save the new variables to your data set so you can use them elsewhere.

Determining the number of components

In the analysis above, the PCA automatically generated 8 variables. It did this using a heuristic known as the Kaiser rule, an option in the Rule for selecting components drop-down menu. This is a commonly used rule, but you can also choose to use two other methods:

Number of components. Choose this option if you want to choose the number of components to keep.
Eigenvalues over. Eigenvalues are numbers associated with each component, and these are listed at the top of each column. This setting lets you specify the cut-off value for components.

Rotations

In the analysis above, I used a technique called Varimax rotation, Displayr’s default option in the Rotation method drop-down menu. The concept of the rotation can be a bit abstract to talk about without getting into the mathematics of the technique. Putting it simply, the PCA problem can have an infinite number of solutions which all capture the same amount of variation in the data. The rotation tries to find which of those many solutions is the easiest to write down an interpretation for, by writing them in a way so that as many loadings are as close to zero (or to a value of 1) as possible.

If you have a favorite rotation method to use, the Rotation method drop-down menu contains several other options. They are all described in mathematical terms, so discussing them here would not add much value if you don’t already have a preferred technique. In my experience, Varimax seems to be the most popular.

Saving variables

To use the results of the PCA in another analysis you need to save the variables to your data set. To do so:

Have your PCA output selected on the page.
From the menu select Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions. This will add the new variable set to the top of the Data Sets tree.
(Optional) Right-click on the row labels in the variable set and Rename them, to make the components more recognizable.

Now, you can create a table from the component scores. The table will be full of 0s, indicating that the average score of each is zero. Don’t be alarmed! This occurs because the variables are standardized – with a mean of zero and a standard deviation of 1 – which is the standard technique. If you create a crosstab with another question, then the variation between variables will become more apparent. For instance, I renamed my components and created a table with the Age groups from the study:

Rather unsurprisingly, the younger people have higher scores on the “Want technology” and “Cost-sensitivity” components, and a much lower score on the “Only use the basics” component.

These new variables can be used just like any other in Displayr. Once you are happy with your new components, go back to the PCA output, and untick the Automatic box. This will prevent any changes to the components. If you modify your PCA later and change the number of components in the solution, you should delete the saved variables and run Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions again.

Hopefully, you find that Principal Components Analysis is easy to do in Displayr, and by saving the variables you can use it to complement your other analyses. Don’t forget the three main steps: set up your data correctly, create the analysis output, and use the output to save your new variables. Good luck and happy dimension reducing!

Learn More about Conjoint in Displayr

Oliver Harrison — Tue, 01 Sep 2020 23:00:38 +0000

Introduction

Conjoint Analysis: The Basics

Main Applications of Conjoint Analysis

Webinar: Introduction to Conjoint

Design

Experimental Design for Conjoint Analysis: Overview and Examples

Writing a Questionnaire for a Conjoint Analysis Study

Sample Size for Conjoint Analysis

Algorithms to Create your Choice Model Experimental Design

The Efficient Algorithm for Choice Model Experimental Designs

The Partial Profiles Algorithm for Experimental Designs

How to Create Alternative-Specific Choice Model Designs in Displayr

How to Set Up a Choice-Based Conjoint Analysis in Qualtrics

How Good is your Choice Model Experimental Design?

How to Check an Experimental Design (MaxDiff, Choice Modeling)

Webinar: How to Create Experimental Designs for Conjoint

Analysis

Formatting Data for Running Conjoint in Displayr

How to do Choice Modeling in Displayr

How to Use Hierarchical Bayes for Choice Modeling in Displayr

How to Use Simulated Data to Check Choice Model Experimental Designs Using Displayr

How to Analyze Dual-Response ‘None of These’ Conjoint Models in Displayr

How to Check a Choice-Based Conjoint Model

Testing Whether an Attribute Should be Numeric or Categorical in Conjoint Analysis

Numeric Attributes in Choice-Based Conjoint Analysis in Displayr

Numeric versus Categorical Price Attributes in Conjoint Analysis

Reordering Attribute Levels in Conjoint Analysis Models in Displayr

Comparing HB Root-likelihood (RLH) Between Displayr and Sawtooth

Checking Convergence When Using Hierarchical Bayes for Conjoint Analysis

Performing Conjoint Analysis Calculations with HB Draws (Iterations)

Comparing Choice Models and Creating Ensembles in Displayr

12 Techniques for Increasing the Accuracy of Forecasts from Conjoint Analysis

Understanding Logit Scaling

Computing Willingness-To-Pay (WTP) in Displayr

Webinar: Statistical Analysis for Conjoint

Visualizations

Data Visualization for Conjoint Analysis

Using Indifference Curves to Understand Tradeoffs in Conjoint Analysis

Using Substitution Maps to Understand Preferences in Conjoint Analysis

Creating Demand Curves Using Conjoint Studies

Webinar: Reporting for Conjoint

Webinar: Discover the Top Six Techniques of Pricing Research

Simulators

Creating Online Conjoint Analysis Choice Simulators Using Displayr

Adjusting Conjoint Analysis Simulators to Better Predict Market Share

Optimizing your Conjoint Analysis Simulator in Displayr

How to Create an Online Choice Simulator by Hand

Using Choice-Based Conjoint in Pricing Research Studies

Using the Value Equivalence Line (VEL) with Conjoint Simulators

Webinar: Reporting for Conjoint

Webinar: Discover the Top Six Techniques of Pricing Research

Case Study: Eggs Choice Simulator

Case Study: Fast Food Simulator

Optimizing your Conjoint Analysis Simulator in Displayr

Oliver Harrison — Mon, 31 Aug 2020 02:07:49 +0000

The choice simulator is one of the main objectives of choice-based conjoint analysis. This allows you to predict the effect of different scenarios on preference or market share. For this case study, we have used the cruise ship data set which Sawtooth supplied in their 2016 modeling competition. This post assumes you have already created your simulator using a conjoint model in Displayr.

Modifying your simulator

In our example, we have created a basic simulator with 3 alternatives:

There are various ways of modifying your simulator, including weighting and making adjustments to reflect market share. Here, we intend to add a combo box for filtering our preference shares by a specific survey question. We will use likelihood to travel in the next 10 years.

Connecting your simulator to a combo box

The best location for this combo box is on the Page Master which is accessible via Appearance > Page Master. This will allow the same control to appear on every page using this template while retaining the user selections.

We can create our own title page by selecting Title Only and pressing Home > Duplicate. We will rename this 'Page with Combo box'.

To add the control item, go to Insert > Control > Combo Box. We then connect Items from to an existing table for likelihood to travel. Alternatively, you can just paste the label options separated by a semi-colon into Item list. In this case, we will also need to delete the default items there. Next, we change Selection Mode to Multiple selection. You can also optionally change the name under Properties > GENERAL.

We can now go back via Appearance > Normal and change the simulator page via Home > Layout > Page with Combo box.

With Displayr, you can easily filter data using a combo box with an R variable. As we are using a single-response question but wish to allow multiple selections, we need to first make it binary via Insert > Filter > Filters from Selected Data. You should then select the appropriate respondent data file under Data Sets and go to Insert > R > Numeric Variable. For a multiple-response combo box, the filter formula to use in the R CODE field is as follows:

rowSums(`Question_name`[, combo_box_name, drop = FALSE]) > 0

The Question name can simply be dragged over to this field from Data Sets to look like this:

rowSums(`Q3: How likely is it that you will take a cruise vacation sometime in the next 10 years? - Filters`[, Combo.box, drop = FALSE])> 0

This code will filter Q3 to the items selected in 'Combo.box'. It will then only include the respondents who fall into these categories.

Next, tick Usable as a filter. We will name this 'combo.filter'. Now you can go back to your simulator page and apply 'combo.filter' to your 'preference.shares' output under FILTERS & WEIGHT > Filter(s).

Below is the formatted version of our simulator:

Weighting your data by alternative-specific respondent preference shares

Displayr allows you to complement your simulator with further visualizations that help tell the story of your data. One way to make further use of our simulator is to weight our demographic questions by a selected alternative's preference share results.

We will begin by making a new page with the same default combo box via Home > New Page > Page with Combo box. We will now copy the 'preference.shares' output from the simulator page via Home > Duplicate and drag it over to the new page to get the respondent-level results.

First, we need to remove the combo box filter from the output. We then need to paste the below code at the bottom of Properties > R CODE:

preferences.by.respondents = data.frame(matrix(resp.shares, ncol=3))
colnames(preferences.by.respondents) = c("Alternative 1","Alternative 2","Alternative 3") 
preferences.by.respondents

You will need to change the 'ncol' reference and column names to match the number of alternatives in your simulator.

The next steps involve creating the combo box filter. In the menu ribbon, select Insert > Control > Combo Box and paste Alternative 1; Alternative 2; Alternative 3 in Item list. I have named this combo box 'cCruise'.

Next, create the filter variable via Insert > R > Numeric Variable and paste the below into the R CODE field:

preferences.by.respondents[, cCruise]

This code will filter 'preferences.by.respondents' by the alternative number selected in 'cCruise'. Once you tick Usable as a weight, this can be applied to your outputs under Inputs > FILTERS & WEIGHT > Weight.

This allows you to add visualizations for various demographic questions with the combo box filter and weight applied to the source tables. Remember to drag the tables off the page and select Appearance > Hide. You can also use a variety of conjoint-specific visualizations, such as a demand curve for the price attribute.

Creating an optimizer

An alternative to creating an online simulator is to create what we call an 'Optimizer'. Unlike a simulator, an optimizer allows multiple selections per attribute and generates multiple-preference share combinations at the same time.

To create an optimizer, you can either select your conjoint analysis output and click Inputs > SIMULATION > Create optimizer or go to Insert > More > Conjoint/Choice Modeling > Optimizer from the ribbon. You will need to then specify the number of alternatives and whether you wish to include alternative-specific attributes. We will choose 3 here and disregard the alternative attribute. This will create a page called 'Optimizer'.

Similarly, we will also apply our combo box filter to the preference share output on this page.

Again, you can format the page objects as desired. In this case, an Autofit table provides more flexibility for the summary preference share table as you can easily drag the edges to align with the optimizer's columns. You can create this via Insert > Paste Table, ticking Autofit, and selecting this page's preference share output under DATA SOURCE.

Due to the varying size of the table, we can fix the height to ensure it adds a scroll bar. We will add row.height = "15px", to Properties > R CODE where the row specific fields are.

We can now select the original output, drag it off the page and press Appearance > Hide to ensure it remains hidden from the published version of the document.

Using your optimizer

One specific use case for the optimizer is fixing the options for the second and third alternatives while selecting multiple options for the first alternative. In the first column, we will select all the options under Room, Amenities, and Price to generate the 30 combinations for the multi-selected combo boxes.

A benefit of autofit tables for this scenario is we can automatically pre-sort the table from highest to lowest by the first column. Simply go to Inputs > ROW MANIPULATIONS, tick Sort rows, place '1' in Column used for sorting rows, and tick Sort in decreasing order.

You can see the finished document here.

Formatting Data for Running Conjoint in Displayr

Oliver Harrison — Thu, 20 Aug 2020 23:17:53 +0000

There are many survey platforms that do not come with their own built-in choice-based conjoint question type. This then poses the question of how to set the data up so we can analyze it. We will now take you from the experimental design stage to your analysis stage while outlining the correct data structure.

Setting up your experimental design

In this simple conjoint example, we wish to look at the meat, sauce, and bun preferences for burgers. We have used Displayr to generate a balanced design via Insert > More > Conjoint/Choice Modeling > Experimental Design. This is based on 10 questions (or tasks) per respondent, 3 alternatives and 3 attributes per question, and 100 versions in total.

The options for the attributes (Meat, Sauce, Bun) have been pasted into the Add attributes and levels spreadsheet as below:

In the same menu, we can select Preview Choice Questionnaire to create a preview of your design. Below are the first 2 questions of version 1.

In the choice model output under Design, you will see the first 4 key columns: Version, Task, Question, and Alternative. The other columns represent the labels for the various items shown in each of the 3 attributes, Meat, Sauce, and Bun.

The 3 Alternatives correspond to the 3 columns in the questionnaire preview above i.e. Option 1, Option 2, and Option 3. Question will remain within the fixed 1 to 10 range. Task, though, is accumulative so the second version of the design will start from 11, the third from 21, and so on.

By simply clicking this output and selecting Export > Excel > Current Selection, you can export the design for programming in your survey platform.

If needed, you can also generate a numeric version of the design via Inputs > DIAGNOSTICS > Numeric design.

If you are programming your survey in Qualtrics and have API access, you can export your design directly into your survey via the Export Design to Qualtrics option.

Setting up your respondent data

In preparation for data collection, you should pre-program the necessary variables into your survey. Below you will see the key conjoint fields for matching with our design. This is from the first 10 records of the burger survey. Here, we have recorded both the Version number and the task number for each Question. The Version number is sufficient, however, if the task order is unchanged from the design order.

The format of the Alternatives selected in the survey is one variable per task corresponding to the 3 columns on display. This should either be Numeric or Categorical to ensure they are read correctly. You can convert text variables by changing INPUTS > Structure to Numeric.

If you have a 'None of these' option you will need to code this response as 0 or set Missing Values to Exclude from analyses. When importing labeled designs from Excel as a data set, you should, therefore, check the variable values. If you additionally ask a dual-response 'None of these' question after each task, you will also have the same number of 'Yes/No' questions.

Selecting the correct source options for your model

Once data collection is complete, you will need to import the appropriate data files via Home > New Data Set. We will now go to Insert > More > Conjoint/Choice Modeling > Hierarchical Bayes to start analyzing our choice data.

Under RESPONDENT DATA > Data source, you will be able to select Choice and task variables in data set or Choice and version variables in data set. When using the first option, ensure the task variables are in the exact same order as the choice variables. This is the same for when using the Dual-response 'none' choice field to select the 'Yes/No' questions. The second option allows you to simply reference the version variable if no task variables are supplied.

To input the matching design, under EXPERIMENTAL DESIGN > Data source select the appropriate option. In this case, we would select Experimental design R output as the design was created in Displayr. There are also options for Sawtooth and JMP files, and Displayr Experiment questions. The default Data set option is for all other externally created conjoint designs. When using a 'data set' option, the appropriate variables should be placed in their respective Version, Task, and Attributes fields.

Apart from when using Sawtooth CHO files (saved as .txt), Experiment questions, and Displayr designs, the other source options require one data set for the design and one for the responses. If your data comes from Alchemer (formerly SurveyGizmo), you will instead have a conjoint data set and a respondent data set. By going to Insert > More > Conjoint/Choice Modeling > Convert Alchemer (Survey Gizmo) Conjoint Data for Analysis, Displayr will append the choice and version variables to the respondent data set, provided 'ResponseID' appears in both files. For the design component, the 'ResponseID' from the conjoint data set is used as Version and 'Set Number' as Task.

For many of these options, there will also be an Enter attribute levels button at the bottom. This uses the exact same format as when we set up the design in a previous step.

Get in touch if you have any questions, and see our blog to keep up on all the latest Displayr features.

Driver Analysis in Displayr

Tim Bock — Tue, 05 May 2020 08:11:30 +0000

Displayr's driver analysis makes it both easy and fast to perform driver analysis. This post gives an overview of the key features in Displayr designed for performing driver analysis (i.e., working out the relative importance of predictors of brand performance, customer satisfaction, and NPS). This post describes the various driver analysis methods available, stacking, options for missing data, in-built diagnostics for model checking and improvement, and how to create outputs from the driver analysis.

For more detail about what method to use when, see our driver analysis webinar and eBook.

Choice of driver analysis method

All the widely used methods for driver analysis are available in Displayr. They are accessed via the same menu option, so you can toggle between them.

Correlations: Insert > Regression > Driver analysis and set Output to Correlation. This method is appropriate when you are unconcerned about correlations between predictor variables.
Jaccard coefficient/index: Insert > Regression > Driver analysis and set Output to Jaccard Coefficient (note that Jaccard Coefficient is only available when Type is set to Linear). This is similar to correlation, except it is only appropriate when both the predictor and outcome variables are binary.
Generalized Linear Models (GLMs), such as linear regression and binary logit, and the related quasi-GLM methods (e.g., ordered logit): Insert > Regression > Linear, Binary Logit, Ordered Logit, etc. These address correlations between the predictor variables, and each of the different methods is designed for different distributions of the outcome variable (e.g., linear for numeric outcome, binary logit for two-category outcome, ordered logit for ordinal output).
Shapley Regression: Insert > Regression > Driver analysis and set Output to Shapley Regression (note that Shapley Regression is only available when Type is set to Linear). This a regularized regression, designed for situations where linear regression results are unreliable due to high correlations between predictors.
Johnson's relative weight: Insert > Regression > Driver analysis. Note that this appears as Output being set to Relative Importance Analysis. As with Shapley Regression, this is a regularized regression, but unlike Shapley it is applicable to all Type settings (e.g., ordered logit, binary logit).

Stacking

Often driver analysis is performed using data for multiple brands at the same time. Traditionally this is addressed by creating a new data file that stacks the data from each brand on top of each other (see What is Data Stacking?). However, when performing driver analysis in Displayr, the data can be automatically stacked by:

Checking the Stack data option.
Selecting variable sets for Outcome and Predictors that contains multiple variables (for Predictors these need to be set as Binary - Grid or Number - Grid).

Missing data

By default, all the driver analysis methods exclude all cases with missing data from their analysis (this occurs after any stacking has been performed). However, there are two additional Missing data options that can be relevant:

If using Correlation, Jaccard Coefficient, or Linear Regression, you can select Use partial data (pairwise correlations), in which case the data is analyzed using all the available data. Even when not all the predictors have data, the partial information is used for each case.
If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Multiple imputation can be used. This is generally the best method for dealing with missing data, except for situations the Dummy variable adjustment is appropriate.
If using Shapley Regression, Johnson's Relative Weights (Relative Importance Analysis) or any of the GLMs and quasi-GLMs, Dummy variable adjustment can be used. This method is appropriate when the data is missing because it cannot exist. For example, if the predictors are ratings of satisfaction with a bank's call centers, branches, and web site, if data is missing for people that have not attended any of these, then this setting is appropriate. By contrast, if the data is missing because the person didn't feel like providing an answer, multiple imputation is preferable.

Diagnostics for model checking and improvement

A key feature of Displayr's driver analysis is that it contains many tools for automatically checking the data to see if there are problems, including VIFs and G-VIFs if there are highly correlated predictors, a test of heteroscedasticity, tests for outliers, and checks that the Type setting has been chose correctly. Where Displayr identifies an issue that is serious it will show an error and provide no warnings. In other situations it will show a warning (in orange) and provide suggestions for resolving the issue.

One particular diagnostic that sometimes stumps new users is that by default Displayr sometimes shows negative importance scores for Shapley Regression and Johnson's Relative Weights. As both methods are defined under the assumption that importance scores must be positive, the appearance of negative scores can cause some confusion. What's going on is that Displayr also performs a traditional multiple regression and shows the signs from this on the relative importance outputs as a warning for the user that the assumption of positive importance may not be correct. This can be turned off by checking Absolute importance scores.

Outputs

Standard output output from all but the GLMs is a table like the one below. The second column of numbers shows the selected importance metric, and the first column shows this scaled to be out of 100.

Quad map

A key aspect of how driver analysis works in Displayr is that it can be hooked up directly to a scatterplot, thereby creating a quad map. See Creating Quad Maps in Displayr.

Crosstabs of importance scores

All the driver analysis methods have an option called Crosstab interaction, where a categorical variable can be selected, and the result is a crosstab that shows the importance scores by each unique value of the categorical variable, with bold showing significant differences and color-coding showing relativities.

Accessing the importance scores by code

The importance scores can also be accessed by code. For example, the raw importance scores are accessed using model.1$importance$raw.importance, contains the raw importance scores, where model.1 is the name of the main driver analysis output.

This can then be used in other reporting. For example, when inserted via Insert > R Output, table.Q14.7[order(model.1$importance$raw.importance, decreasing = TRUE), ] sorts a table called table.Q14.7 by the importance scores, and paste(names(sort(model.1$importance$raw.importance, decreasing = TRUE)), collapse = "\n") creates a textbox containing the attributes sorted from most to least important.

Tools for Evaluating Segmentations

Tim Bock — Tue, 05 May 2020 07:49:09 +0000

The easy bit of market segmentation is creating segments. The hard bit is working out if the resulting segments are useful. In this post, I review tools four tools for making this process more efficient: using heatmaps to summarize lots of crosstabs, segment comparison tables, smart tables, correspondence analysis bubble charts, bubble charts that have been designed for comparing segments, and automatic updating.

Recap: How to create segments

Creating segments consists of two steps. First, we find some data that describes key differences between people. Typically the best data is obtained from surveys, collecting data on attitudes and behavior. The second step is to form the segments, where the options are to use:

Pre-defined segments. For example, age, gender, family life stage, company size, or industry.
Statistical techniques, such as cluster analysis, latent class analysis, and neural networks, which create new segmentations, where each segment consists of people that are similar based on the data being analyzed.
Strategy, which involves a combination of using pre-defined segments and statistical techniques.

The beginner's mistake is to choose to create a single segmentation. The mistake usually starts from a misunderstanding, which is the mistaken belief that a market contains a small number of segments and the goal of market segmentation is to find them.

The expert move is to create many segmentations - typically dozens - and work out which of these is "best", where best means provides the most strategic advantage. The key insight behind this approach is that in just about all markets there are no "natural" number of small segments. Rather, there are an infinite number of ways of carving segments out of the market. The more segmentations that are evaluated, the better the likelihood that a good one is found.

Once you have created multiple segmentations to be compared, you should end up with a data file, which contains:

One or more segment membership variables. Each segmentation will be represented by a separate variable. That is, a single column of data, where each person has a number assigned to them (e.g., a 1 for the first segment, 2 for the second segment, etc.).
Profiling variables, describing key differences between people. Typically this will include things like their attitudes, behaviors, demographics, and media usage.

Evaluating segments and segmentations

And now we move onto the meat of this post. If we have multiple segmentations, how do we efficiently compare them? The traditional approach has been to create lots and lots of crosstabs, and read through them all. This is a slow and painful process, and if you go down this route the odds are you will only end up evaluating a small number of possible segmentations which is, as described in the previous section, the wrong route. However, there are a number of ways of short-cutting this.

Heatmap summarizing lots of crosstabs

One way of automating the process of inspecting lots of crosstabs is to create all the crosstabs, but then, rather than read them, instead create a heatmap that summarizes what they show. The heatmap below compares six alternative segmentations using 115 profiling variables. Each row of the heatmap shows the statistical significance of the crosstab with each fo the profiling variables and each of the crosstabs. I've represented the statistical significance using z-statistics, as they create a better visualization than p-values (a z of more than 1.96 corresponds to a p of 0.05 or less).

Looking at the heatmap below we can see, for example, that segmentation 1's segments differ more in terms of work status, occupation, and age, than do any of the other segmentations (i.e., the blue is darker, which means a higher z-score, suggesting a more significant relationship). The second segmentation better explains differences in top of mind awareness, the perception that brands are fashionable, etc.

You can find out more about how to create this heatmap by reading this post.

Counting the number of significant differences

The chart below counts up the number of profiling variables that were statistically significant, as shown in the heatmap. it tells us that the 3rd and 5th segmentation are related to more of the profiling variables than the others, and should be the first segmentations we focus on. Additional insight can be obtained by doing similar analyses for sets of profiling variables (e.g., demographics, usage variables, etc.). And while you can create these by counting them in your heads, it's just a line of magic code to automate the process of counting up the number of differences: colSums(most.significant.results > 1.96).

One big segment profile table

Once you have identified the key segmenations to focus on, the next step is to carefully examine each to understand how their segments differ.

This next time saver is one that took us a bit of time to add to the software. We kept getting the request, but for some reason didn't figure out why it was important. The basic idea of the table below is that it is a single table, rather than lots of tables, that summarizes the relationship between the segments in a segmentation and all the profiling variables. Now, you may be thinking "it's just a table", but it has a few special features:

It shows the segment names and sizes at the top and when you scroll they stay pinned to the top
Shading is used to show the magnitude of differences between segments and you can control this (e.g., I've used borders instead)
Font color is used to show significance tests
It is showing both categorical and numeric data
You can embed it on a web page to give stakeholders access to it
You can set up filters so that clients can further drill into it

This is created using Insert/Create > Segments > Segment Comparison Table.

Smart tables

A practical problem with segment comparison tables is that with a big study they can just be too big to use. So, how should we select which variables to include? We should only include strategicially interesting vairbales. But what if we have lots of these? The simplest approach to doing this is to us automated statistical tests to identify which variables to include. In Q this is done using Insert > Tables > Smart Tables. In Displayr, using Insert > More > Tables > Lots of Crosstabs and choosing the options for deleting non-significant tables.

Correspondence analysis for better understanding a categorical profiling variable

The heatmap allows us to compare lots of segmentations. The segment comparison table gives us detailed information on a single segmentation. The next level of analysis is to get a lot of depth on a set of variables or categories from categorical variables. Correspondence analysis is often the best tool for this, as it draws our attention to the key differences between the segments. The example below, for example, shows us that Segments 3 and 4 skew to old people, and the others to younger people.

Bubble charts for comparing segmentations

If we have a single key numeric variable, we can compare it across multiple segments using a scatterplot like the one shown below, where the alternative segmentations appear on the x-axis, the key variable on the y-axis, and the sizes of the segments are shown by bubbles. We can see in this example that age seems to be the best of the alternative segmentations being compared as it has the highest degree of discrimination and the there are no huge segments. (The variation in the sizes of the age segments hints at some methodological problems however...). See chapter 3 of our Brand Analytics ebook for instructions on how to create this visualization.

Lastly, don't forget automatic updating

Most importantly, both of our products, Q and Displayr, have the ability to automatically update charts and tables with new data. This means that you can create a detailed set of tables or visualizations describing segments, and then automatically populate these with alternative segmentations.

Improving Segmentations Using Within Variable and Within Case Scaling

Tim Bock — Tue, 05 May 2020 06:29:50 +0000

This post describes how to apply the three standard ways of scaling (transforming) rating scales data prior to using cluster analysis and other algorithms for forming segments: standardizing within variable, standardizing within case, and unit scores within variable. The post starts with a discussion of the reasons for scaling. Then, it reviews the three standard ways of scaling. Then, there's a discussion of some other common ways of transforming data prior to creating segments. The post ends with a discussion of how to validate segmentations after scaling.

Why data is scaled

Segmentation algorithms, such as k-means, attempt to create groups that optimally summarize the strongest patterns in the data. However, sometimes the strongest patterns in the data are not very useful. In particular, three common problems are:

A small number of variables dominate the segmentation. For example, if one variable has a range of 11, and the others all have a range of 1, it's likely that differences on the variable with the large range will dominate the segmentation. This problem is addressed by scaling variables.
Patterns consistent with response biases are identified. For example, the segmentation below is based on ratings of how important people believe different things are to buyers of market researc services. Two clusters are shown. The average for cluster 1 is higher on every single variable than cluster 2. One explanation is that cluster 1 just regard everything as more important. The more likely explanation is that cluster 1 consists of people who have a tendency to give higher ratings (i.e., a yeah-saying bias). This problem is addressed by scaling within case.
The segments are just not interesting/useful, and there is a desire to obtain segments that are in some ways different. Some people new to segmentation are a bit surprised by this goal, as they often have a misunderstanding that segmentation is about "finding" segments in data. However, segmentation is more about "creating" segments. There are numerous arbitrary decisions that go into segmentation, and each reveals some different aspect of the underlying data. The goal is to find the segmentation that is most useful.

The three common ways of scaling

Standardizing data, in the context of clustering and other segmentation techniques, usually refers to changing the data so that it has a mean of 0 and a standard deviation of 1.

Standardizing within variable

The toy example below shows a data set containing 4 observations (variables) and three variables. Variable A will be ignored by any sensible segmentation algorithm, as it has no variation. But, all else being equal, a segmentation algorithm will be more likely to generate segments that differ on variable C than B. This is because C has a higher variance (standard deviation) than B. A fix for this problem is to modify each variable so that they have the same standard deviation. The simplest way of doing this is to subtract the mean from each variable and divide by the standard deviation, as done on the right. In Displayr, this is done by selecting the variables or variable set and clicking Object Inspector > ACTIONS > Scale within variable - standardize or by selecting this option from Insert > Transform.

Note that after this scaling variable A contains entirely missing data, and needs to be excluded from any segmentation.

Standardizing within case

Both before and after standardizing within variable, variables B and C are highly correlated (the standardization does not affect their correlation). When most of the variables are highly correlated it guarantees that the resulting segments will primarily differ in terms of their average values. Standardizing within case means scaling each case (row) of the raw data so that it has a mean of 0 and a standard deviation of 1. In Displayr, this is done by selecting the variables or variable set and clicking Object Inspector > ACTIONS > Scale within case - standardize or by selecting this option from Insert > Transform.

Compare the data for cases 3 and 4. In the original data, case 4 has values of 1 and 7, whereas case 3 has values of 1 and 4. After the scaling, cases 3 and 4 are now identical. Also note that variable A previously contained no information, but it now does contain variation, as case 2's score of 1 on A is, by this case's standards, nowhere near as low a score as it is for the other cases.

The output below shows the two cluster solution for the market researcher data after standardizing within case. The yeah-saying bias has been removed.

Unit scaling within variables

An alternative to standardizing within variables is to scale the data to have a unit scale, which means a minimum value of 0 and a maximum value of 1. This form of scaling is most useful when the input data has different scales (e.g., some variables may be on 2-point scales and others on 9-point scales). In Displayr, this is done by selecting the variables or variable set and clicking Object Inspector > ACTIONS > Scale within variable - unit or by selecting this option from Insert > Transform.

Other scalings

Other transformations

Any transformation of variables can be used as a way of preparing data prior to using a segmentation algorithm, such as logarithms, rankings, square roots, and top 2 boxes, to name a few. These are available in Displayr via Insert > Transform and by clicking Object Inspector > ACTIONS.

Dimension reduction

Another popular approach to scaling data is to use dimension reduction techniques such as principal component analysis/factor analysis, and multiple correspondence analysis.

Multiple scalings

It is not unknown to apply multiple scalings in sequence. For example, first standardizing within case and then within variable, or, the other way around.

Validating scalings

Ultimately any approach to scaling the data is arbitrary and as such it may improve things, or, make them worse.

Evaluating differences with the raw data

The most fundamental check is to check the resulting segmentation using the original, untransformed, variables. The first two columns of data in the table below just repeat the initial segmentation from near the beginning of this post. The columns on the right demonstrate that the segments formed using data standardized within case are different even when compared using the original data. This is both important as a check for validity and also important for reporting (as showing results using scaled variables is a sure-fire way of confusing end-users of a segmentation).

Comparing to other data

A second way of validating the scaling is to check that the segments are correlated with other variables. For example, the segments formed with the standardized data do predict differences in whether somebody is a research provider or not, which demonstrates that the segments are not merely "noise".

General usefulness

The last key consideration when evaluating a scaling is the most important: is the resulting segmentation useful for the purpose for which it has been created?

For more information about how to perform segmentation, see our webinar and our eBook.

Automatic Language Translation of Text Variables

Tim Bock — Sun, 03 May 2020 07:08:01 +0000

In the screenshot below, for example, a text variable called Multilingual has been selected, and the button for Language translation then appears on the right side of the screen. When this button is pushed, a dialog box appears asking you what language the input text is in, and what language you want it translated into. If you have an input variable with multiple languages you can supply a variable indicating language, enabling multiple languages to be translated at the same time.

Automatic Language Translation with Automatic Categorization

Tim Bock — Sun, 03 May 2020 06:45:30 +0000

Our automatic categorization tool (Insert > Text Analysis > Automatic Categorization > Unstructed Text) now has the ability to translate both the input language, and the resulting outputs. So, if you have data in a language you can't understand, you can still get the text data automatically categorized, with the results returned the language of your choice.

The translation is performed using Google Cloud Translation, and you manually specify the Source language. You can specify multiple source languages by providing a variable, as in the example below.

You can also set the Output language to any major language group:

Principal Component Analysis of Text Data

Tim Bock — Sun, 03 May 2020 04:40:14 +0000

Worked example: Understanding attitude towards Tom Cruise

This post analyzes text data where people have listed their reasons for not liking Tom Cruise. The raw data is shown in the table below.

One component

By default, Displayr creates a PCA with two components, but to explain the technique I'm going to start by looking at the result with a single component. With one component, the PCA of text data seeks to find a single numeric variable that best explains differences in text.

The table of loadings below shows the correlation of different words and phrases with the numeric variables that describe the text. The way to read it is as follows:

The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
The slightly weaker correlations for Exclusive: nothing is for people who mentioned nothing, but didn't mention it as a part of a bigram (a pair of words that appear commonly together).
Stem: not is the correlation of the word not and any words that commence with not (e.g., not, nothing) with the numeric variable.
nchars is the number of characters in the text. As it has a negative correlation it means that the more somebody typed, the lower their score on the variable that has been identified.
The first component is negatively correlated with Negative sentiment (i.e., the higher the score, the higher the negative sentiment, and thus high scores on the variable correspond to positive sentiment).

Putting all the results together tells us that if we have to summarize the text data as a single numeric variable, that variable measures whether they said Nothing on one end of the continuum, or didn't say nothing on the other.

The table below shows the numeric variable that has been computed. We can see, for example, that respondent 10 has said nothing and has a relatively high score (2.3). Respondent 1's answer isn't purely Nothing, which is why his score is closer to 0 (the average). By contrast, respondents who didn't write nothing have negative scores.

Two components

The table below shows the loadings from the two component solution. The first component has essentially the same meaning as in the first analysis. But, if you scroll down, you will see that the second component is measuring whether or not somebody didn't say (note the negative correlation) tom cruise. This second component measures at one end mentioning Tom Cruise and like, and at the other end not mentioning Tom Cruise and not mentioning like.

Four components

When we look at the four component solution, we end up with four variables that have the following interpretation:

First component variable - whether the text said nothing or similar variants described in the other first variable situations above.
Second component variable - whether the text mentions like or actor.
Third component variable - whether the text has Stem: scientolog (i.e., scientology or scientologist and any misspellings beginning with scientolog). Also words that have synonyms with faith are positively correlated with this variable.
Fourth component variable - Not mentioning crazy.

The table below shows the raw values of the four variables, sorted by the fourth variable (lowest to highest). We can easily see here that the further the value below zero on the fourth variable, the more likely they were to reveal they regarded Tom Cruise as being crazy.

This analysis is useful in its own right, as a summary of the key trends in the data. And, the variables can be used as inputs into other analyses, such as cluster analysis or latent class analysis (segmentation).

Selecting the number of components

How many components should you have? This is likely best determined by by judgment. Choose the number which leads to a result that makes sense.

An alternative is a scree plot. The basic idea is that you imagine that the plot is showing an arm, and you want to have the number of components that occurs at around the "elbow". In this example we have a double jointed elbow, so the plot at best tells us that 10 or fewer components is appropriate. As mentioned in the previous paragraph, my recommendation is to just use judgment.

One common heuristic for selecting the number of components is to use the Kaiser rule (eigenvalues > 1). Such rules aren't practical when using PCA for text data. This is because the PCA has 512 dimensions, and pretty much any traditional heuristic for determining the number of dimensions will recommend too many dimensions (e.g., with this example, the Kaiser rule suggests 81 components).

Instructions for conducting principal component analysis of text data

To conduct the analysis in:
- Displayr: Insert > Text Analysis > Advanced > Principal Components Analysis (Text)
- Q: Create > Text Analysis > Advanced > Principal Components Analysis (Text)
Set the text variable in the Variable field.
Specify the desired Number of components.
Press ACTIONS > Save variables to save the variables to the data file.

How it works

The text data is cleaned
If necessary it is translated into English
It is converted into 512 numeric variables using Google's Universal Sentence Encoder for English.
A PCA is performed on the 512 numeric variables and the scores are extracted
A term-document matrix is created from the cleaned text data, along with sentiment analysis, and some related variables.
The loadings are computed as the cross-correlation matrix of the term-document matrix (rows) and the PCA scores (columns).
A varimax type rotation is applied to the loadings.

Automatic Removal of Outliers from Regression and GLMs

Tim Bock — Sun, 03 May 2020 04:33:59 +0000

A well-known problem with linear regression, binary logit, ordered logit, and other GLMs, is that a small number of rogue observations can cause the results to be misleading. For example, with data on income, where people are meant to write their income in dollars, maybe one person writes their income as 50, meaning $50,000, and a billionaire may also include their much larger income. In this post I describe how you can automatically check for, and correct for, such problems in data. Such rogue observations have various different names, such as outliers and influential observations.

How to detect rogue observations

There are two basic stages of detecting rogue observations. The first is to create and inspect summary plots and tables of your data prior to fitting a model. The second is to use automatic tests that check to see if there are any observations that, when deleted from the data used to fit the model, cause the conclusions drawn from the model to change.

In Displayr and Q various standard techniques are used to see if there are any rogue observations. If detected, they appear as warnings, like the one shown below. If new to statistics, the warnings can be a bit scary at first. Sorry! But, do take the time to process them, once you get over the scariness, you will grow to appreciate that they are structured in a useful way.

The first thing to note is that one reason that they are scary is that they are written in very precise language. Rather than say "yo, look here, we've got some rogue observations", they are using the correct statistical jargon, which in this case is that the rogue observations are influential observations. This is due to the fact it's referring to the hat values which is another statistical term to refer to its contribution to the final regression estimates. Further, it's describing exactly how these hat values have been defined so that it can be reconciled if you want to consult a textbook. Most importantly, it is giving you a solution, which in this case is to re-run the analysis using automated outlier removal.

Automated outlier removal

Below the warnings, you will find an option for setting the Automated outlier removal percentage. By default, this is set to 0. But, we can increase this percentage and remove the most outlying observations (based on studentized residuals for unweighted models and Pearson residuals for weighted models).

There is no magical rule for determining the optimal percentage to remove (if there was we would have automated it). Instead, you need to make judgments, trading off the following:

The more observations you remove, the less the model represents the entire dataset. So, start by removing a small percentage (e.g., 1%).
Does the warning disappear? If you can remove, say 10% of the observations and the warning disappears, that may be a good thing. But, it is possible that you always get warnings. It's important to appreciate that the warnings are designed to alert to situations where rogue observations are potentially causing a detectable change in conclusions. But, often this change can be so small to be trivial.
How much do the key conclusions change? If they do change a lot, you need to consider inspecting the raw data and working out why the observations are rogue (i.e., is there a data integrity issue?).

As an example, the scatterplot below shows the importance scores estimated for two Shapley Regressions, one based on the entire data set, and another based on 20% of observations being removed. With both regressions there are warnings regarding influential observations. However, we can see that while there are differences between the conclusions of the models (the estimated importance scores would be in a perfectly straight line otherwise), the differences are, in the overall scheme of things trivial and irrelevant, giving us some confidence that we can ignore the outliers and use the model without any outlier removal.

You can open here in Displayr to try for yourself.

Efficient Coding of Spontaneous Awareness Data in Q and Displayr

Tim Bock — Mon, 30 Mar 2020 20:14:19 +0000

This post discusses the two approaches to efficient coding of spontaneous awareness data in Q and Displayr, and when to use which. While the example focuses on spontaneous awareness, it applies to any situation where there is a need to categorize lists of text data (e.g., product purchase, occasions).

A spontaneous awareness question is an open-ended question that asks respondents to name the first brands that come to mind associated with particular products or services. An example of a spontaneous awareness question is When you think of cell phone companies, which ones come to mind?. Respondents completing the questionnaire, type their responses into an open-ended text box or boxes. Brand awareness is considered to be an influential forecaster of how customers make choices when purchasing brands and services and top of mind awareness (also known as share of mind) is measured by the number of times a brand or service is mentioned first.

An example of such data is shown below. This table aptly illustrates the two key aspects of spontaneous awareness data:

There is a lot of repetition in the data, which means its analysis is amenable to automation.
There are many inconsistencies in the way people write and the language they use. For example, at n t, Att, att, at and t, and AT&T. Because of these inconsistencies, the data cannot be automatically tabulated. There is a need to deal with all the variations.

Ways to collect spontaneous awareness data

The smart way: multiple text boxes

Giving respondents 10 boxes to enter brands, not one, generally results in respondents entering a single brand per box. This, in turn, makes the task of coding a survey easier and simplifies the process of automatically categorizing the data. This is apparent in the table above with each cell containing just a single brand.

The foolhardy way: a single text box

The more traditional (foolhardy) way to collect spontaneous awareness data is to give the survey respondent a single text box to fill in their open-ended answers. This allows each respondent to type in their response, choosing whatever delimiter they wish. A human being with knowledge of the cell phone market can discern that the first respondent mentioned four brands, as seen below. However, getting a computer to work this out is considerably harder.

xfinitiy spring t mobile at n t cricket
Verizon
Apple and Samsung
T-Mobile
verizon
Verizon
Apple, samsun, lf, lenovo, huawai; noki and one plus

Displayr and Q's two tools for coding spontaneous awareness data

Displayr and Q each contain two distinct ways of coding spontaneous awareness data.

Manual coding - While this sounds onerous, it is usually the fastest approach if the data has been collected using multiple text boxes. This is because when Q and Displayr automatically code any terms they have seen before, so once you have allocated each of the common misspellings of AT&T, all future appearances will automatically also be categorized.
Automatic coding - This is the best approach in three situations:
- When data that has been collected the traditional way (i.e. a single text box). Automatic coding, that uses machine learning techniques designed to deal with different delimiters.
- You are in a massive rush. Automatic coding will get the job basically right most of the time with no human intervention at all.
- You want the coding process to be 100% automatic when new data is collected. That is, if somebody comes up with a completely new way of mis-spelling AT&T (e.g., AT@t), you want it to be automatically categorized with no human intervention.

Manual coding

We've got lots of documentation about how to do this, so I won't repeat it here. Please see Manually Coding Multiple Response Text Data in Displayr and the Q wiki for more information.

Automatic coding of lists of items

Displayr and Q have a special tool designed for categorizing lists of items, such as brand names. In Displayr it is accessed using Insert > Text Analysis > Automatic Categorization > List of Items and in Q via Create > Text Analysis > Automatic Categorization > List of Items, and then selecting the Text Variables to be categorized. See below, the output from carrying out the automatic coding. A few things to note:

The most common brand shown is Verizon. It appears 339 times, and the algorithm has automatically identified 9 different variants. If you move your mouse over Verizon, you will see all the variants.
The table to the right shows how the text has been changed and is sorted according to the degree of changes that the algorithm has made.
While the algorithm has been smart in working out that there are 11 variants of AT&T, it isn't psychic, and you can see it has created Att as a separate category, so we need to train it.

Merging categories

By expanding out the diagnostics section at the bottom of the table, you will see a group called Variant suggestions. Within this group are additional suggestions for further merging data. These suggestions are where the algorithms think that the data could potentially be merged further.

To implement these suggestions copy the table, select it by dragging with your mouse, press Ctrl-C to copy. In the object inspector, click on REQUIRED CATEGORIES > Add required phrases and variants and paste, Ctrl-V, in the table. You can manually copy and paste phrases and variants to modify them further. You can copy and paste it in Excel. List the categories to merge, with the name of the final category on the left, see the example below:

When you click OK, the table on the left updates to show the frequency of the different brands

Saving as variables

Variables are added to the data set when coding manually. With using automatic coding, it's necessary to click Insert > Text Analysis > Advanced > Save Variables > Categories.