Text Analysis - Displayr

Fast track categorizing and coding text data

Andrew Kelly — Mon, 20 Mar 2023 02:18:54 +0000

Overview

Displayr's text coding functionality is designed with needs of the survey researcher front and centre. For many years the text categorization functions in Displayr have already supported what we might call a manual workflow. We make it easy to view, sort, and filter text responses, create and structure categories, and assign or code responses to those categories. More recently we've added semi-automated functions to the interface and extensively upgraded the algorithms that drive them. We believe our tools in this space are state-of-the-art ...

In selecting "Semi-Automatic" Text Categorization, users are presented immediately with a draft set of categories with the bulk of the data already coded. So in a matter of minutes you are off to a great start
The algorithms that create this output are based on analysing context and meaning (not word similarity, like many other tools). Your draft code frames are intuitive from the get-go
We've made this work effectively for multiple response categorizations, (where responses can be assigned to more than one code), which are historically more challenging to automate
For tracking and related research, we have specific algorithms that recognize and categorize unaided brand awareness questions
Once you have your draft categories, the user interface makes it easy to edit, with tools to combine, rename, and split categories

So the workflow now becomes:

Let Displayr do the hard work and get you most of the way there (via a draft categorization), but in a fraction of the time it would take manually
You then fine tune and edit the categories via the intuitive user-interface.

Accessing the automated functions

The quickest way to do this is to select a text variable in the Data Set tree, hover above or below it to '+' insert a new variable, and follow the prompts via the Semi-Automatic menu path:

We know some users might want to start the process manually. This could involve reading through some responses and create some pre-planned categories. Even if you follow the Manual menu path, you can access the Automatic categorization function. At any time you can speed up the coding of remaing uncategorized data. In the categoriztion interface, set "Sort by:" to Fuzzy match, (as matching is a key building block of the algorithm), and the "Auto" button appears:

The functionality and workflow in action

Take a look at the process in action in this short video. It uses an open ended question on how people feel about 'Tom Cruise' as input*

You can get a broader overview of text analysis methods and solutions in this webinar recording. How to quickly analyze text data

Streamline your text data analysis.

The process of turning open text responses into usable data is traditionally time consuming and expensive (being often outsourced). Displayr's text categorization tools are state of the art. You can create a draft categoriztion in minutes automatically and then quickly fine tune it into a polished codeframe. If you use a lot of text data and want to know more, book a demo or take a free trial.

* Discretion is advised - the data used in the video is from a real survey containing unvarnished attitudes to Tom Cruise. Some respondents have written unkind, distasteful and potentially offensive things. Displayr does not condone or endorse any of the comments that have been made.

Save time translating and coding text data

Andrew Kelly — Wed, 21 Sep 2022 01:42:29 +0000

While there are several ways to translate text data in Displayr, our text categorization function is made even more powerful by having Automatic Text Translation built into the interface. When you insert a new text categorization variable (+ > Text Categorization > .... > New - more on this in the video below), you are given the option to Translate the text:

Use any Source and Output language.

You are then prompted to select the Source language:

Automatically detect language
Specify with variable (use this option if the source language is identified by a variable in your data set. This option is particularly useful if your file contains multiple languages)
A specific language - the default language is English.

You can set the Output language here as well.

Text Translation in Action

Consider a simple scenario where data has been collected on hotel reviews and it includes a "comment" option - guests of course need to complete the survey in a language they are comfortable with:

The data file will have multiple languages, and (typically) an additional variable classifying the language selected.
The person responsible for categorizing (coding) the data will want to do so in their language
The outputs, being the categories (or code-frame), will also need to be in their language.

Displayr makes all this very easy, including dealing with multiple language inputs simultaneously. And once translated, you can create an intial catergorization (code frame) automatically.

Take a look at the process in action in this short video, covering both single and multiple language translation ...

Streamline your text data translation and analysis.

The process of translating text data into the analyst's language is traditionally time consuming and expensive (being typically outsourced). Displayr's translation tools are now available directly in the text categorization interface - you can even create a draft code frame in your preferred language automatically. If you use non-native language text data and want to know more, book a demo or take a free trial.

Learn More about Text Analysis in Displayr

Matt Munley — Fri, 30 Oct 2020 22:26:07 +0000

Text Analysis in Displayr - General Resources

These are the best places to start to learn about text analysis in Displayr.

General Categorization (Coding) Resources

Automatic Categorization

Manual & Semi-automatic Categorization

Sentiment analysis

Word Clouds

Other Uses of Text Data

Manipulation of Text Variables

Automatic Language Translation of Text Variables

Tim Bock — Sun, 03 May 2020 07:08:01 +0000

In the screenshot below, for example, a text variable called Multilingual has been selected, and the button for Language translation then appears on the right side of the screen. When this button is pushed, a dialog box appears asking you what language the input text is in, and what language you want it translated into. If you have an input variable with multiple languages you can supply a variable indicating language, enabling multiple languages to be translated at the same time.

Automatic Language Translation with Automatic Categorization

Tim Bock — Sun, 03 May 2020 06:45:30 +0000

Our automatic categorization tool (Insert > Text Analysis > Automatic Categorization > Unstructed Text) now has the ability to translate both the input language, and the resulting outputs. So, if you have data in a language you can't understand, you can still get the text data automatically categorized, with the results returned the language of your choice.

The translation is performed using Google Cloud Translation, and you manually specify the Source language. You can specify multiple source languages by providing a variable, as in the example below.

You can also set the Output language to any major language group:

Principal Component Analysis of Text Data

Tim Bock — Sun, 03 May 2020 04:40:14 +0000

Worked example: Understanding attitude towards Tom Cruise

This post analyzes text data where people have listed their reasons for not liking Tom Cruise. The raw data is shown in the table below.

One component

By default, Displayr creates a PCA with two components, but to explain the technique I'm going to start by looking at the result with a single component. With one component, the PCA of text data seeks to find a single numeric variable that best explains differences in text.

The table of loadings below shows the correlation of different words and phrases with the numeric variables that describe the text. The way to read it is as follows:

The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
The slightly weaker correlations for Exclusive: nothing is for people who mentioned nothing, but didn't mention it as a part of a bigram (a pair of words that appear commonly together).
Stem: not is the correlation of the word not and any words that commence with not (e.g., not, nothing) with the numeric variable.
nchars is the number of characters in the text. As it has a negative correlation it means that the more somebody typed, the lower their score on the variable that has been identified.
The first component is negatively correlated with Negative sentiment (i.e., the higher the score, the higher the negative sentiment, and thus high scores on the variable correspond to positive sentiment).

Putting all the results together tells us that if we have to summarize the text data as a single numeric variable, that variable measures whether they said Nothing on one end of the continuum, or didn't say nothing on the other.

The table below shows the numeric variable that has been computed. We can see, for example, that respondent 10 has said nothing and has a relatively high score (2.3). Respondent 1's answer isn't purely Nothing, which is why his score is closer to 0 (the average). By contrast, respondents who didn't write nothing have negative scores.

Two components

The table below shows the loadings from the two component solution. The first component has essentially the same meaning as in the first analysis. But, if you scroll down, you will see that the second component is measuring whether or not somebody didn't say (note the negative correlation) tom cruise. This second component measures at one end mentioning Tom Cruise and like, and at the other end not mentioning Tom Cruise and not mentioning like.

Four components

When we look at the four component solution, we end up with four variables that have the following interpretation:

First component variable - whether the text said nothing or similar variants described in the other first variable situations above.
Second component variable - whether the text mentions like or actor.
Third component variable - whether the text has Stem: scientolog (i.e., scientology or scientologist and any misspellings beginning with scientolog). Also words that have synonyms with faith are positively correlated with this variable.
Fourth component variable - Not mentioning crazy.

The table below shows the raw values of the four variables, sorted by the fourth variable (lowest to highest). We can easily see here that the further the value below zero on the fourth variable, the more likely they were to reveal they regarded Tom Cruise as being crazy.

This analysis is useful in its own right, as a summary of the key trends in the data. And, the variables can be used as inputs into other analyses, such as cluster analysis or latent class analysis (segmentation).

Selecting the number of components

How many components should you have? This is likely best determined by by judgment. Choose the number which leads to a result that makes sense.

An alternative is a scree plot. The basic idea is that you imagine that the plot is showing an arm, and you want to have the number of components that occurs at around the "elbow". In this example we have a double jointed elbow, so the plot at best tells us that 10 or fewer components is appropriate. As mentioned in the previous paragraph, my recommendation is to just use judgment.

One common heuristic for selecting the number of components is to use the Kaiser rule (eigenvalues > 1). Such rules aren't practical when using PCA for text data. This is because the PCA has 512 dimensions, and pretty much any traditional heuristic for determining the number of dimensions will recommend too many dimensions (e.g., with this example, the Kaiser rule suggests 81 components).

Instructions for conducting principal component analysis of text data

To conduct the analysis in:
- Displayr: Insert > Text Analysis > Advanced > Principal Components Analysis (Text)
- Q: Create > Text Analysis > Advanced > Principal Components Analysis (Text)
Set the text variable in the Variable field.
Specify the desired Number of components.
Press ACTIONS > Save variables to save the variables to the data file.

How it works

The text data is cleaned
If necessary it is translated into English
It is converted into 512 numeric variables using Google's Universal Sentence Encoder for English.
A PCA is performed on the 512 numeric variables and the scores are extracted
A term-document matrix is created from the cleaned text data, along with sentiment analysis, and some related variables.
The loadings are computed as the cross-correlation matrix of the term-document matrix (rows) and the PCA scores (columns).
A varimax type rotation is applied to the loadings.

Efficient Coding of Spontaneous Awareness Data in Q and Displayr

Tim Bock — Mon, 30 Mar 2020 20:14:19 +0000

This post discusses the two approaches to efficient coding of spontaneous awareness data in Q and Displayr, and when to use which. While the example focuses on spontaneous awareness, it applies to any situation where there is a need to categorize lists of text data (e.g., product purchase, occasions).

A spontaneous awareness question is an open-ended question that asks respondents to name the first brands that come to mind associated with particular products or services. An example of a spontaneous awareness question is When you think of cell phone companies, which ones come to mind?. Respondents completing the questionnaire, type their responses into an open-ended text box or boxes. Brand awareness is considered to be an influential forecaster of how customers make choices when purchasing brands and services and top of mind awareness (also known as share of mind) is measured by the number of times a brand or service is mentioned first.

An example of such data is shown below. This table aptly illustrates the two key aspects of spontaneous awareness data:

There is a lot of repetition in the data, which means its analysis is amenable to automation.
There are many inconsistencies in the way people write and the language they use. For example, at n t, Att, att, at and t, and AT&T. Because of these inconsistencies, the data cannot be automatically tabulated. There is a need to deal with all the variations.

Ways to collect spontaneous awareness data

The smart way: multiple text boxes

Giving respondents 10 boxes to enter brands, not one, generally results in respondents entering a single brand per box. This, in turn, makes the task of coding a survey easier and simplifies the process of automatically categorizing the data. This is apparent in the table above with each cell containing just a single brand.

The foolhardy way: a single text box

The more traditional (foolhardy) way to collect spontaneous awareness data is to give the survey respondent a single text box to fill in their open-ended answers. This allows each respondent to type in their response, choosing whatever delimiter they wish. A human being with knowledge of the cell phone market can discern that the first respondent mentioned four brands, as seen below. However, getting a computer to work this out is considerably harder.

xfinitiy spring t mobile at n t cricket
Verizon
Apple and Samsung
T-Mobile
verizon
Verizon
Apple, samsun, lf, lenovo, huawai; noki and one plus

Displayr and Q's two tools for coding spontaneous awareness data

Displayr and Q each contain two distinct ways of coding spontaneous awareness data.

Manual coding - While this sounds onerous, it is usually the fastest approach if the data has been collected using multiple text boxes. This is because when Q and Displayr automatically code any terms they have seen before, so once you have allocated each of the common misspellings of AT&T, all future appearances will automatically also be categorized.
Automatic coding - This is the best approach in three situations:
- When data that has been collected the traditional way (i.e. a single text box). Automatic coding, that uses machine learning techniques designed to deal with different delimiters.
- You are in a massive rush. Automatic coding will get the job basically right most of the time with no human intervention at all.
- You want the coding process to be 100% automatic when new data is collected. That is, if somebody comes up with a completely new way of mis-spelling AT&T (e.g., AT@t), you want it to be automatically categorized with no human intervention.

Manual coding

We've got lots of documentation about how to do this, so I won't repeat it here. Please see Manually Coding Multiple Response Text Data in Displayr and the Q wiki for more information.

Automatic coding of lists of items

Displayr and Q have a special tool designed for categorizing lists of items, such as brand names. In Displayr it is accessed using Insert > Text Analysis > Automatic Categorization > List of Items and in Q via Create > Text Analysis > Automatic Categorization > List of Items, and then selecting the Text Variables to be categorized. See below, the output from carrying out the automatic coding. A few things to note:

The most common brand shown is Verizon. It appears 339 times, and the algorithm has automatically identified 9 different variants. If you move your mouse over Verizon, you will see all the variants.
The table to the right shows how the text has been changed and is sorted according to the degree of changes that the algorithm has made.
While the algorithm has been smart in working out that there are 11 variants of AT&T, it isn't psychic, and you can see it has created Att as a separate category, so we need to train it.

Merging categories

By expanding out the diagnostics section at the bottom of the table, you will see a group called Variant suggestions. Within this group are additional suggestions for further merging data. These suggestions are where the algorithms think that the data could potentially be merged further.

To implement these suggestions copy the table, select it by dragging with your mouse, press Ctrl-C to copy. In the object inspector, click on REQUIRED CATEGORIES > Add required phrases and variants and paste, Ctrl-V, in the table. You can manually copy and paste phrases and variants to modify them further. You can copy and paste it in Excel. List the categories to merge, with the name of the final category on the left, see the example below:

When you click OK, the table on the left updates to show the frequency of the different brands

Saving as variables

Variables are added to the data set when coding manually. With using automatic coding, it's necessary to click Insert > Text Analysis > Advanced > Save Variables > Categories.

Automatically Classify New Text Data Using an Existing Categorization

Tim Bock — Fri, 06 Mar 2020 18:18:01 +0000

Fully automated text analysis can, sometimes, do a great job. However, the gold standard for automatic categorization is to first get a human being to manually "tag" the data, then use machine learning to automatically categorize new data. This saves tons of time, not only when you are working with preliminary data sets and trackers, but also when you just don't have time to manually categorize thousands of responses. Our Semi-Automatic Categorization feature makes manually categorizing responses faster. Conversely, this post describes how to make our Automatic Categorization of Unstructured Text analysis smarter by feeding in some manually categorized data. For a more in-depth discussion of this and other approaches to automatically categorizing text, please see our Using Machine Learning to Automate Text Coding white paper here.

Step 1. Create the categorization on a subset of the data

First, create a categorization using a subset of the data. This could be a random selection of the data or the first wave of a tracking study. For more information about how to do this, please read Manually Coding Multiple Response Text Data in Displayr and Semi-Automatic Coding of Text Data: A Cutting-Edge Technique. For this example, I used some responses received regarding what people like about their cell phone provider. By simply searching and categorizing some keywords, I've managed to code a subset of 594 responses out of the 895.

Step 2. Hookup your existing categorization to our automated text analysis

Next, we can automatically classify new responses as follows:

Create the text analysis output using Insert > Text Analysis > Automatic Categorization > Unstructured Text
In the Object Inspector on the Inputs tab, select your original text variable for Text variable and your manually categorized variable set for Existing categorization. I've done this for my Likes per the screenshot below:

Now, the analysis will automatically calculate. The output will be similar to below. You can see how many responses were automatically categorized by the model (the Predicted column) and how accurate those are based on your originally categorized responses (the Accuracy column).

Note, not all of the responses may fit into one of the existing categories based on the machine learning. In this instance, the response will not be categorized.

Step 3. Save your categories

Finally, you can save your categories into a variable set to use in tables and other outputs by selecting the automatic categorization output and clicking Insert > Text Analysis > Advanced > Save Variable(s) > Categories.

Using Text Data for Driver Analysis

Tim Bock — Tue, 03 Dec 2019 18:48:26 +0000

A driver analysis is used to highlight the key drivers of performance. Traditionally, it uses quantitative data, where the outcome variable is often satisfaction, likelihood to recommend, or some other measure of interest. The predictors of the outcome are ratings or a multiple response question indicating the performance of the product(s) being analyzed. However, text data from open-ended questions, tweets, or some other data source are also useful predictors. In this post, I present an example looking at drivers of preference for US phone companies, and discuss a couple of traps to avoid.

The case study

The data is from a study of the US cell phone market collected by Qualtrics in July and August of 2019. I've used two questions for the analysis. The first is a quantitative question, which measures how likely people are to recommend their main phone brand. The second is qualitative, where people have listed what they like about their brand.

Prior to running the driver analysis I coded the open-ended data into the categories shown below. You can also use automated techniques for extracting key concepts from the data rather than manually coding it. However, in general this data is a bit noisier, so the resulting driver analysis may be less valid when using automated techniques.

Conducting the driver analysis

As we discuss in our eBook on driver analysis, normally with driver analysis it is good practice to use Johnson's Relative Weights or the near-identical Shapley Regression, as they both rescale the data and deal with multicollinearity. But in this case, there is a smarter approach, which is just to use good old fashioned linear regression. What makes it smarter?

One of the key features of coded data is that some categories are bigger than others. In the table earlier in the post, 37% of people are categorized as Reliable/Coverage/Service, and only 2% as Speed. Using Johnson's Relative Weights or Shapley Regression will ensure that Reliable/Coverage/Service is very important, but that Speed is not. We want the driver analysis to determine importance from the relationship between the predictors and the outcome, not the amount of responses in each category.
When we use linear regression we can interpret the estimated coefficients as being differential impacts on NPS. The table below, for example, tells us that all else being equal, if a person likes their phone company due to Price, then their NPS score will be, on average, 18 points higher.

The table below shows the results of a linear regression. At first glance the regression seems to make sense. People who said they like Nothing have got a much lower NPS, which is as we would expect. But, there is actually a problem here. The goal of driver analysis is to understand how experiences with the company influence attitude towards the company, where NPS is a measurement of that attitude. The categories of Nothing, I like them, and Everything aren't actually experiences at all. Rather, they are attitudes. So, the regression we have is meaningless, as it currently tells us that how much people like their cell phone carrier predicts their attitude to their cell phone carrier, which is tautological.

The solution to the tautology is to remove the predictors that are attitudes, which gives the model below. I've also removed Other as it is really a grab-bag of other things and thus uninterpretable.

Checking all the standard things

The next step is to do the standard tests of a regression model (e.g., outliers, multicolinearity, etc.). We discuss these in more detail in our eBook on driver analysis.

Putting it together as a quad map

The quad map below plots the importance scores (the Estimate column from above) on the x-axis and the performance (percentage of people to mention the issues) on the y-axis. In this case it delivers some great news, it identifies three opportunities for phone companies to different themselves. The attributes of Speed, Payment arrangements, and Customer service are all in the bottom-right "quadrant". These are things that people find to be very important, but where the average phone company has low levels of performance, suggesting that if a phone company can persuade more people of its excellence in these areas it will improve its NPS.

Some traps to avoid

Performing driver analysis using text data can be a great win. But, I will finish off the post by pointing out a few traps that can trap the unwary. They all relate to inadvertently using inappropriate data:

Data from people with a known attitude. Sometimes open-ended questions are only asked for people who gave a high (or low) rating. Unfortunately, such data is not suitable for a driver analysis. The whole point of driver analysis is to see how one thing (the text data) predicts another (the overall measure of preference). But, if we have only conducted the analysis among people that like their brand, then we have insufficient variation in their attitude to the brand to work out what causes it. The same problem exists if we have only collected text data from people known to dislike the brand.
Using data from a Why did you say that? question. A second problem is where people were first asked their attitude, and then asked why did you say that. This is a problem because the actual meaning of this question is contextual. The person who said they really disliked the brand reads the question as why did you dislike the brand? whereas the person that likes the brand reads it as why do you like the brand? This means the text data is not comparable (e.g., if somebody says "price" it may mean the price is too high or too low).
Using sentiment analysis on a How do you feel style question. In the case study I am using a rating of likelihood to recommend as the outcome variable. An alternative approach is to use an open-ended question and create an outcome variable by sentiment analysis. However, if doing this, some care is required, as it can easily be invalid. For example, let's say you asked How do you feel about Microsoft? Some people may respond by saying how much they like Microsoft. Other people may interpret this as an opportunity to describe what Microsoft is good at. A driver analysis of such data will be meaningless, as it will show that people mention specific things (e..g, Microsoft is innovative) will be less likely to give an attitude (e.g., I love Microsoft), as in effect they answered a different question, so we would end up with a driver analysis that tells us that being innovative is bad!

At Last, Machine Learning Can Accurately Categorize Text Data

Tim Bock — Fri, 08 Nov 2019 16:21:29 +0000

5 Machine Learning Breakthroughs to Accurately Categorize Text!

For the last 20 years, the survey research industry has waited with bated breath for text analysis technologies to transform the way we analyze text data. In the last year or so, technology has reached a point where they can work with a high level of accuracy.

Five key advances have made it practical for computers to automatically code text with a high level of accuracy: the prediction-based human-computer interaction model, high-quality automated predictive algorithms, universal encoders, improved analysis frameworks, and a focus on the machine-understandable text.

1. The prediction-based human-computer interaction model

In the early days of artificial intelligence, the goal was to program the if-this-then-that style rules. These were known as expert systems. All the early text analysis systems in market research work the same way, with users effectively writing rules about how to categorize text data.

In the last two decades, we have recognized that part of the secret to effectively using AI is to always characterize it as a prediction problem. For example, if you're trying to translate text, rather than using dictionaries and language rules, the modern approach is to instead feed the algorithms large quantities of text that's been translated into two languages—building models that predict the text in one language based on the other.

In the case of text data from survey responses, the problem is then one of figuring out the best way to get data from the user that can be predicted. One way of doing this, which is very effective, is to get a person to manually categorize (code), say, 200 responses, train a machine-learning algorithm to predict this categorization, then use it to predict the categories of all the text data not already coded. In Displayr we refer to this as Automated Text Analysis. A variant of this is to have the user manually perform categorization, then have the machine-learning systems operate in the background while making recommendations to the user. These recommendations will get more and more accurate as the user types their text.

As an example, the screenshot below is from data asking people why they dislike Tom Cruise, After only coding 20 unique responses, it's possible to build a surprisingly good model. The orange bars show the likelihood that a respondent is similar to the people that have been manually categorized as being egotistical.

2. High-quality automated predictive algorithms

Twenty years ago, the best practice for predictive modeling consisted of:

Having a highly-skilled statistician with a lot of relevant experience who understood things like heteroscedasticity, autocorrelation, interactions, transformations, family-wise error rates, basis functions, and how to address multicollinearity
Using generalized linear models
Careful selection of appropriate predictors

However, the second key breakthrough has been that the development of fully-automated predictive models that, when left on their own do a better job than the most skilled statistician with a generalized linear model. The most well-known example of this is the deep learning algorithms that capture the attention of the media, but for most of the types of problems faced by market researchers, random forest and xgboost usually work better.

3. Improved encoders

A predictive algorithm is only as good as the predictor variables, which is where the most recent innovations have occurred. Encoding refers to converting the text data into numbers.

Document-term-matrix

The first important encoding tool of text data was the document-term-matrix, which creates a variable for each term (word or phrase) that appears in each text response (each text response is referred to as a document in Text Analysis literature). The early years of text analysis consisted of a whole host of improvements in the creation of document-term-matrices, such as dealing with spelling mistakes, synonyms, and the detection of phrases (n-grams, to use the jargon).

The example below shows a document-term-matrix for reasons for disliking Tom Cruise.

Sentiment analysis

The next level of improvements in encoding consisted of attempting to extract meaning from text data, rather than just common words. In market research, this is most commonly associated with sentiment analysis, as shown in the example below.

Entity extraction

More sophisticated approaches to extracting sentiment from text data do so by attempting to understand the structure of the language (e.g., identify nouns, verbs, etc.). One useful application of this approach is entity extraction, which is illustrated in the example below.

Word embeddings

The next key innovation was Word embeddings. A word embedding represents a word by a set of coordinates (numbers). The two-dimensional embedding of the terms from the document-term-matrix above is shown below. Typically these are created using neural networks.

Sentence embeddings

And then the final step, which was only cracked in the past couple of years (by the good folk at Google), was sentence encoding, where whole sentences are assigned coordinates.

4. Better analysis frameworks

In the early days, text analysis was viewed as technique, just like, say, cluster analysis or conjoint. Many weird and wonderful visualizations were developed as a part of this approach. But, what's become much clearer in recent times is that this is the wrong way to think about text analysis. The better way to think about it is as a processing stage, which involves encoding text data as numbers. These numbers are just analyzed in the normal way, using standard (or awesome new) quant analysis tools.

The main implication of this for market research is a simple one: rather than attempting to create special-purpose word analyses, the goal is generally instead to use machine learning to replicate traditional coding, which is then analyzed using crosstabs.

5. Using machine-understandable text

The last breakthrough relates to making sure that the text data being analyzed can be understood by a machine. This one sounds really obvious when described, but it is quite profound in its implications.

There are two things we can do to ensure that we analyze data that is appropriate for text analysis tools:

Ask machine-understandable questions
Filtering data to make it consistent

Asking machine-understandable questions

Traditionally, open-ended questions have been written in a very broad way to avoid inadvertently restricting the type of data provided. For example, after asking people to rate how likely they are to recommend a company, the traditional follow up is something like "Why did you say that?". In the world of qualitative research that's not such a bad question. But, in the world of text analysis, it is horrible. Let's say somebody says "price." You can't interpret the data without knowing their response to the earlier question, and this information is typically not included in the text analysis.

If you know you are going to use automatic methods for text analysis, a much better approach is to split the question into two parts, such as "What do you like about your phone provider?" and "What do you dislike about your phone provider? ".

Or better yet, "Please list everything you like about your phone company" and providing the respondent with multiple separate boxes to enter their data into, so that you can more readily tease out differences between the responses. The more structure and the more effort put into ensuring that the responses are comparable, the better the job a computer will do in summarizing the data.

Filtering data to make it consistent

Consider data from a concept test, where you've asked people why they like each concept for 20 concepts, and the responses are all stored in a single text variable. The obvious thing to do is to attempt to perform text analysis on this variable. This can work, but only if there are consistent themes across all of the concepts. If the concepts and the types of answers given are very different, neither human nor machine will be able to extract themes. The fix is to instead perform the analysis for each concept separately (of course, you'd still want to try and do them all at once first up, as maybe they do have common themes).

The outcome

The outcome of these five innovations is that it is now possible to automatically code text data.

Semi-Automatic Coding of Text Data: A Cutting-Edge Technique

Tim Bock — Mon, 04 Nov 2019 05:57:28 +0000

Manually coding text data into categories is one of the great pains of survey research. By contrast, many automatic text coding tools ease the pain of coding but don't do as good of a job. As a result, oftentimes people just use word clouds to analyze text instead. However, a recent breakthrough in coding can give you the best of both worlds! Displayr's semi-automatic text analysis tool gives you the control to categorize text as you wish alongside automatic categorization capabilities. Better yet, as you categorize more text, the tool becomes smarter and can help you identify other relevant phrases and categories in your data.

In this post, I provide a walk-through of how to use Displayr's semi-automatic text analysis tool. While this example focuses on text that needs to be categorized into overlapping categories (i.e., multiple response coding, to use the market research industry jargon), the same basic approach can be used for creating mutually exclusive categorizations (i.e., single response coding).

The data

The data is from a study of the (mobile) cell phone market. We asked people what they liked about their main cell phone provider. The raw responses are shown in the table below.

Getting started

We start by adding a data set, selecting the text variable (it will have an a icon, as shown to the right), and pressing Insert > Text Analysis > Semi-Automatic Categorization > Multiple Overlapping Categories > New.

Adding categories

Categories, known to market researchers as the code frame, are shown on the right side of the screen. By default, Displayr automatically starts with two categories:

1. Missing Data

2. New Category (You can rename New Category by right-clicking and selecting Rename.)

If you do not know what categories to create, you can read through the data to come up with your own categories or press the Suggest button. The first time you do this, you will have a long wait while Displayr builds some models in the background. Though after this initial run-through, it will be much faster!

To add new categories one at a time, right-click and select Add Category. To add categories in bulk, click Import/Export Category Names. The screenshot below shows the six initial categories that I created for this cell phone data set.

Performing a fuzzy sort

Type the name of the first category that you wish to identify in the Fuzzy sort on box and press Sort now. Displayr will think for a bit and sort the items in the list according to their similarity to the term. You'll notice in the screenshot above, I've done a fuzzy sort on the word Nothing (as in, there is "nothing" that the respondent likes about their phone company). The orange bars show how similar the words are to the search term. In the screenshot above, all the orange bars are the same length, as all the text we see precisely matches the word "nothing". You can see the orange bars become narrower in the screenshot below as results become less exact.

Now, I'm sure your first instinct when you see that Everything is a fairly high result is that the algorithm is very dumb. But, what's really going on is a bit more subtle than this. Everything and Nothing are both quantities. They are similar constructs. While our human judgment tells us not to group them together, no computer can know for sure how to treat these without some training. This is the entire logic behind semi-automated text analysis! We progressively give the computer more information so that it can refine the models until they are accurate.

To help teach the algorithms within Displayr the difference between different terms, you simply categorize more text. For example:

Click on the category you wish to use on the right side of the screen. In this case Nothing.
Select the text responses you wish to categorize into that category. You can either:
- Use the Shift and Ctrl keys on your keyboard to multi-select from your fuzzy sort.
- Type a word or phrase into the Find box and exact matches to this will be selected.
Press the Add to button. This tags the responses with the category.
Repeat this process for the remaining categories.

Predicting similarity

When we use fuzzy sorts, Displayr is using encodings in the background (see At Last, Machine Learning Can Finally Categorize Text Data Accurately!). However, once we have put some data into multiple categories, machine learning models have sufficient data to be trained, which allows Displayr to provide more accurate suggestions.

In the screenshot below, I've allocated 13 unique responses to the Nothing category and 13 unique responses to the Everything category and then chose Sort by as Similarity to Nothing (i.e., the Nothing category). Looking at the responses, you can see that model's done a good job:

It's identified text responses where people have just failed to spell remotely correctly.
There is a clear drop-off in the orange bars between responses similar to Nothing and those that clearly don't belong.

Hunting for missing categories

The most challenging aspect of using semi-automatic coding is ensuring that you have not inadvertently missed a key category. There are three strategies for doing this:

1. Read through, say, 200 responses and check that each has been categorized appropriately

This is the textbook approach to ensure that you have not missed a category.

2. Examine uncategorized data

Change Show responses from (at the top-left of the screenshot) to UNCATEGORIZED.

3. Suggest

This automatically hunts through the dataset to try and find clusters of similar words and proposes these as a new category name, which you can investigate by using the fuzzy sorting.

Refining the categorization

Once you have started coding some data, you may find that you need to change your categories or move text between them. Please see Manually Coding Multiple Response Text Data in Displayr for more information about how to do this.

Automatic updating with new data

Once you have performed a categorization semi-automatically once, you can then entirely re-use it automatically with new data. See Automatic Coding of Unstructured Text Data for more information about this.

To recap

You no longer have to choose between the pain of manual coding and the quality of your results. Both are achievable by using a few key features in Displayr's semi-automatic categorization tool:

The Suggest button analyzes your text and recommends additional categories.
Fuzzy sort on finds exact as well as non-exact matches to your search terms.
Sort by Similarity to categories analyzes your coded text and tries to find uncoded text that is similar to responses in the category.

Automatic Coding of Unstructured Text Data

Tim Bock — Mon, 04 Nov 2019 00:07:20 +0000

The data

The data is from a study of the (mobile) cell phone market. We asked people what they liked about their main cell phone provider. The raw responses are in the table below.

Fully automated text analysis (quick and dirty)

The quick-and-dirty approach to analyzing text data in Displayr is to:

Go to Insert > Text Analysis (Analysis) > Automatic Categorization > Unstructured Text.
In the Object Inspector (the section that opens on the right of the screen), under Inputs > Text variable select the variable that holds the text you want to analyze.
Change the Inputs > Number of categories to the number of categories you would like to classify the data into. I've chosen 10 for this example.

The output will calculate automatically (may take a little while to think), and looks like this:

The category names on the left are something of pot luck. In this output you can see:

the automatically generated categories,
the center column the proportions
counts of the number of cases
examples of the types of responses that have been allocated.

We haven't yet cracked an algorithm that reliably gives human-understandable names. So, the secret is very much to look at the example and to expand out the examples (clicking the grey triangle ) to see all the data.

From our experience and reports from our clients, fully-automatic text analysis can often give good insight. But it is not as good as doing manual coding. This is where the next approach comes in handy.

Automatic updating of text analyses (wide and awesome)

The gold-plated approach to efficiently doing automated text analyses is as follows:

Manually or semi-automatically perform an analysis of, say, 300 text responses.
Insert > Text Analysis > Automatic Categorization > Unstructured Text
In the Object Inspector, set Existing categorization to the variable set that contains the manual or semi-automatic categorization.

Displayr will then train a machine learning model using the existing categorization and predict the categories for the remaining text, often with extremely high accuracy.

Fully automated categorization (the first part of this article) only forms mutually exclusive categorizations. However, when you use it for automatic updating, as per this section, it also works for overlapping (multiple response) categorizations.

How to Save the Categories to your Data Set

You can easily save the categories assigned to your data - so that you can use them in other analysis. Make sure that the output above is selected on the Page and then go to Insert > Text Analysis > Advanced > Save Variable(s) > Categories. A new variable appears in your Data Sets called "Categories from..."

To create a simple example of a table that uses categorized data and another variable, drag the "Categories from" variable from the Data Sets pane onto a document Page. In the results are below, I also crossed the automatically generated categories with the education level of the respondents (by placing it in the columns of the table).

Other automated approaches

The approach described above uses what we call unstructured text. Sometimes text data can have much more structure. We have two other tools designed for automatically categorizing such data:

Insert > Text Analysis > Automatic Categorization > List of Items is designed for brand lists and other lists.
Insert > Text Analysis > Automatic Categorization > Entity Extraction identifies people, places, and other entities automatically from data.

Choosing the Right Type of Manual Categorization for Text Analysis

Kris Tonthat — Fri, 01 Nov 2019 05:25:46 +0000

Mutually Exclusive Categories means each of the text responses will be assigned to a single category. It's appropriate if the data is relatively simple and each response really only captures one idea. For example:

You asked respondents in a survey to give you the name of a brand (perhaps their favorite brand or what brand first comes to mind)
You want to allocate a response into only one category (perhaps in terms of sentiment, such as positive/neutral/negative or to the main idea of the response)

Mutually Exclusive Categories will result in one new categorical variable.

Multiple Overlapping Categories is used when each piece of text represents more than one idea, preference, or opinion. Typically it is used when the text information is longer than a few words or a single sentence. Examples include:

You asked in a survey for a completely open-ended response, such as describing a brand, the reasons for giving a rating on a scale, and so forth
Respondents give you multiple brand names in the one variable
Spontaneous awareness results are recorded across multiple variables - one for each brand
You are analyzing a blog post or a social media post, and there are lots of possible themes within the text that you want to pull out and analyze by

Multiple Overlapping Categories results in multiple new variables combined as a variable set (a Binary - Multi).

It is binary because each can either have a Yes/No (1 or 0) if the response is associated with the category or not.

For a worked example of how to categorize into Multiple Overlapping Categories, see this blog post.

Automatic List Categorization of Text Data with Displayr

Kris Tonthat — Fri, 01 Nov 2019 04:21:59 +0000

It can often be difficult and time-consuming to organize raw text data into meaningful insights. Manually coding even a single text question can take several hours, even with relatively small sample sizes. Displayr has built-in text categorization tools designed to help you quickly categorize your text data to help easily find valuable insights. One of these tools is the automatic categorization of text data into a list of items.

The example below uses text data from a survey about Tom Cruise. An open-ended question asked respondents "What don't you like about Tom Cruise?", which yielded a wide range of responses. Use the automatic categorization tools in Displayr to generate a list of items to see what main themes are of concern to the survey respondents.

Automatic Categorization of Text Data into a List of Items

Start by first importing your data containing the text variable you want to categorize into Displayr (see here for more information on importing). With the data loaded into Displayr, you're now ready to run the categorization analysis:

Go to Insert > Text Analysis (Analysis group) > Automatic Categorization > List of Items.
In the object inspector (the section that opens on the right of the screen), under Inputs > Text variable(s) select the variable that contains the text you want to analyze.
Change the Inputs > Minimum category size to the number of categories you would like to classify the data into. For this example, we'll set the number of categories to 2. This means that Displayr will only show categories that contain at least 2 responses.

4. The output will calculate automatically:

Interpreting the Output

The left side of the output lists the automatically generated categories. The number of individual responses that are put into each category are displayed in the frequency column. The variants indicate how many different specific variations of responses there are in each category. The right side of the output shows the raw (original) text and the normalized or categorized text. The normalized text is the "cleaned" text. In other words, this is the text after spell checking, stemming, word removals and other cleaning functions are performed.

The colored boxes around the word categories are randomly assigned to each category. The only purpose of this is to give specific words a unique style and make them easier to identify. There is no further meaning to this. If "Scientologist" is in a blue box, then all occurrences of "Scientologist" will be in a blue box. This makes it easier to recognize common categories.

How to Save the Categories to your Data Set

You can saved the created categories to your data set as a new variable which can then be used in other analyses. Make sure that the output above is selected on the Page and then go to Insert > Text Analysis > Advanced > Save Variable(s) > Categories. A new multi-select variable is added to your Data Set tree below your original text variable. You can now use this like any of the other variables in your data to create tables, crosstabs, charts and visualizations.

Automatically Extract Entities and Sentiment from Text

Tim Bock — Fri, 01 Nov 2019 02:50:33 +0000

Text data often refers to entities, such as people, organizations, or places. These entities can be automatically extracted from text data, and then used in further analyses. As an example, in this post, I reanalyze a famous set of tweets by a candidate for the US Presidency in 2016 and see how the sentiment relates to who has been named in the posts.

Extracting the entities

In Displayr, we extract the entities by clicking Insert > Text Analysis > Automatic Categorization > Entity Extraction and then selecting the text variable of interest. After a bit of a wait, you get the output below. You can expand out these groups to see what's been found.

Saving the entities as variables

The next step is to save the entities as variables in the data file. This is down by selecting the output, and then clicking Insert > Text Analysis > Advanced > Save Variables > Categories.

Comparing sentiment by entities

OK, so within the Person entity, we've worked out who is mentioned in the tweets. What next?

We can crosstab this by other information. In the example below, I've computed the sentiment of the tweets (Insert > Text Analysis > Sentiment) and crossed them by the items within the Person entity. The averages show the sentiment scores assigned to each tweet that mentioned these names. Scores below 0 indicate negative sentiment. Scores in red indicate statistically significant low sentiment. You can probably work out who sent the tweets!

Automatic Categorization of Unstructured Text Data

Kris Tonthat — Fri, 01 Nov 2019 00:27:07 +0000

Categorizing text data can be a time-consuming and expensive activity. In cases where time is short and budgets low, using automatic categorization of text data can save the day and give you a good idea of what's contained in your data.

In the following example, I have some text data collected in a survey about Tom Cruise. The question was "What don't you like about Tom Cruise?" and the responses are pretty varied. Categorizing these responses would normally take a good couple of hours, if not longer. Here, instead, I'll run an automatic text categorization to see what main themes are of concern to the survey respondents.

How to run Automatic Text Categorization in Displayr

I've imported my data as usual (see here for more on that) and I'm ready to begin my analysis.

Go to Insert > Text Analysis (Analysis) > Automatic Categorization > Unstructured Text.
In the object inspector (the section that opens on the right of the screen), under Inputs > Text variable select the variable that holds the text you want to analyze.
Change the Inputs > Number of categories to the number of categories you would like to classify the data into. I've chosen 15 for this example.
The output will calculate automatically, and looks like this:

On the left of this output, you can see the automatically generated categories, the center column the proportions and counts of the number of cases in the file that has been allocated to that category, and on the right, examples of the types of responses that have been allocated. Clicking the button will show you all the text that's been assigned to that category.

How to Save the Categories to your Data Set

Saving the categories assigned to your data—so that you can use them in other analyses—is easily done. Make sure that the output above is selected on the Page and then go to Insert > Text Analysis > Advanced > Save Variable(s) > Categories. A new variable will be added to your Data Sets called "Categories from..." This new variable will store which category each case in your file has been categorized into. It allows you to then combine this categorized data with other variables in your data set.

To create a simple example of a table that uses categorized data and another variable, I start by dragging a variable from the Data Sets pane onto a document Page. Next, I select a second variable, in this case, the one containing the categorized data, and drag it onto the table I already created, taking care to drop it in the Columns field that appears when you hover over the table. The result is the table shown below, where I reduced the categories to five and crossed the automatically generated categories with the education level of the respondents in my data set.

Fully-integrated Text Analysis with Displayr

Katie Savoca — Thu, 03 Oct 2019 04:17:27 +0000

In a state of information overload, people are communicating digitally more than ever whether it’s a tweet, reviews, or open-ended responses from a survey. The data within those messages is an untapped gold mine of thoughts and opinions that companies can turn into insights.

The good: you’re able to get down to the nitty-gritty on how someone really feels about a product, brand, topic to improve their overall experience.

The bad: you’re most likely outsourcing your text data to another company. Or you are manually coding the lines of text yourself so it’s not exactly a walk in the park.

You’ve probably thought to yourself – there has to be a better way to do this. Queue Displayr – the all in one survey analysis and reporting tool with built-in Text Analysis.

Displayr's Text Analysis features allow users to:

Eliminate the need for multiple tools that are both time consuming and even expensive. Save time with an improved process with one tool for all your survey analysis and reporting needs. Even automatically update reports with new data so there is no need to start over from the beginning – simply put the data in to get the story out. If you’re interested in learning more about Text Analysis and how you can cut your analysis time in half – book a demo today!

Not ready for a demo, check out our latest Text Analysis webinar and test it out for yourself with a free trial of Displayr.

How to Calculate Sentiment Scores for Open-Ended Responses in Displayr

Matt Steele — Mon, 05 Aug 2019 01:45:20 +0000

Nothing is ever as accurate as having a researcher manually coding text variables, one case at a time. But in the case of a large survey sample (or in the case of Big Data), the efficiency gained in using sentiment analysis can outweigh the loss of accuracy in finding the story in your data.

How does it work?

Displayr sends the text variable to an online English dictionary (using R) to score the words as positive, negative, or neutral. Positive words get a +1 scoring, while negative words get a -1 scoring. The final sentiment score is the sum of these scores. The process also attempts to identify when sentiment has been negated. For example, “not good” would generate a score of -1 instead of 1.

To illustrate, consider the following cases from a hypothetical text variable. The first case receives a sentiment score of +2, while the second case has a score of -2. The words contributing +/-1 towards the total score in each case are shown in brackets:

I really enjoyed (+1) the webinar – it was fun! (+1): Score = +2

I didn’t like (-1) the webinar – because I hate (-1) the speaker: Score = -2

A sentiment score is generated for every respondent in the survey and saved as a numeric variable.

How do you run it in Displayr?

Displayr makes it convenient to compute a sentiment score variable. Simply select the text variable in the Data Tree tab and then select Insert > Text Analysis > Sentiment from the Ribbon.

The result is a new numeric variable in your Data tree available for analysis. You can use this variable in a variety of ways:

In cross-tabulations with other questions to see how the sentiment score may vary for different groups within the sample.
Looking at correlations of sentiment scores with other numeric variables (e.g., use Correlation Matrix).
You could also turn the numeric sentiment score variable into a categorical variable to divide your sample into those who are positive, neutral, and negative on the topic.

In some cases, you may like to “clean” your raw text variable before computing the sentiment scores. This is where the Text Analysis Setup feature can help (click here for more detail). In Displayr it is found under Insert > Text Analysis > Advanced > Setup Text Analysis. This creates an R output on the page where the raw text is processed for spell-checking, stemming, removal of words, replacement of specific words, and combination of words into phrases. To calculate the sentiment scores from the Text Analysis Setup, simply select the Text Analysis Setup on the page, and then Insert > Text Analysis > Sentiment from the Ribbon.

Try it yourself

In Displayr, you can try with this sample document here. Use the last variable in the data set (open-ended attitudes towards Microsoft).

You can also use sentiment analysis with social media data. This was the subject of this case study, which analyzed Trump’s Tweets during the 2016 election. For a demonstration of how sentiment analysis works on raw social media data, check out the Trump Tweet Case Study. For a demo of our sentiment analysis tool and other advanced text analysis tools you can view our Text Analysis webinar.