Advanced data checking, cleaning, and tidying

Don't get tripped up by dodgy data!

Watch this webinar and learn the advanced data cleaning techniques that make your data more accurate and your workflow faster.

The Displayr document featured in this webinar can be found here.

Part 1: The beginner’s guide to data cleaning and preparation

This webinar will focus on

  • Automating data cleaning and checking
  • Creating new variables
  • Visualizations for understanding missing data patterns
  • Automatically combining categories and variables
  • Advanced techniques, including model-based outlier detection, model-based imputation, and multiple imputations


This builds on from our earlier beginner's guide to checking, cleaning, and tidying data.

As mentioned in the beginner's webinar, we've got two broad goals: better insights, faster.

And, we do this by checking, cleaning, and tidying data.


In our beginner's guide webinar, I focused on the workflow of checking your data file and then going through every variable, and checking, cleaning, and tidying it.

We will now go into more depth on these topics. And, we will look at the new topic of how to identify low quality cases

And how to speed things up via automation



This page lists the specific topics we will work our way through.

We will kick off with a quick way of checking a data file. Both Q and Displayr have a tool that automatically checks a data file's quality.

We will check this little data file

Pages Tree > Below Agenda > + > Report > Check for errors

So, as you can see, there's no ID variable, and some of the variables seem to be set as categorical, when they should be numeric.

So, we've got this variable called FMX, it's got no label, so we don't know what it is, and it seems to be numeric, but has been stored as categorical.


Categorical variables

We will start with categorical variables.

Everything in grey was covered in the beginner's guide webinar.

We went through the tables looking for problems. If you are in a huge rush, this is also something you can automatically do. I will choose the file we started cleaning last week.

Click on Phone.sav

+ > Report > Tables for Data Checking

Displayr's created different tables showing the various possible problems.

As the folder name implies, this is a table where the sample size varies within the table. As discussed in the beginner's webinar, this is often a sign of data problems.

Here's some histograms for different numeric variables that are thought may contain outliers. I will shortly return to how these work

This folder contains highlights data that needs to be checked because there are small counts.

And, as you can see, it's highlighted some garbage data, which we need to clean up.


Converting categorical data to top 2 boxes

This is probably the most common way of tidying data in market research. We've got multiple ratings. And, we want to create new variables that show the top 2 box score, which, in this case, is the percentage of people that either Like or Love the brands.

Click on Like and Love, right-click > Create NET

But, we don't want it like this. Instead, we want a completely new set of variables, which only show the top 2 box score.

So, we first need to find the variable se used to create the table.

+ Create > Ready Made New Variables > Top 2 Category....

As you can see, this new table just contains the Like + Love data.

The reason we want a new set of variables, is that we want to be able to crosstab it neatly.


Automatically combine...

Here's a common type of ugly data. Zip codes or postcodes. We want to aggregate them into a more meaningful way. This is easy.

Click on > Rat Sightings > Incident Zip

+ Ready-Made Formulas > Automatically Combine Categories > By Geography > Insited states > To Counties

A new variable has been created that contains the combined categories.

So, now we can see the counties.


Converting income to numeric

Here's a table with just too many categories. We could merge the small ones. But, often useful convert it to being numeric, so we can see the average.

Structure: Numeric

Mmmm.. This says the average income is 21.2. that's obviously wrong!

Why does it show that?

In the raw data, we can see that a score of 1000 to 2999 has a value of 2. If we want to have a more sensible numeic variable, the secret is to replace the values with the midpoints.

For example, we could set this value to 2000

2 -> 2000

But, that'd take a while as we have so many categories.

Fortunately, both Displayr and Q can do this automatically.

+ Ready Made New Variables > Numeric Variables from Code/Category Midpoints


As you can see, Displayr's looked at the labels, and made sensible guesses about midpoints.


Creating NPS as a numeric variable

We can calculate the NPS of his data pretty easily.

Insert calculation

Create NPS

However, the smart way is to create a numeric variable:

Convert to NPS


Numeric variables

In our earlier webinar, we looked at the minimum and maximum values and used it to make some cleaning and tidying decisions.


Automatically identifying...

Another approach is to use rules that automatically identify outliers
The most common ways of working out outliers is like this.

You calculate the Average. Here, the average weight of a sample of people is 266 pounds.

You calculate the standard deviation, which is 262.

Then, an outlier is defined as any data that is more than 3 standard deviations from the mean. In this example, that means an outliers is somebody who has a weight of less than negative 520 pounds and more than 1052 points.

This is the rule that Displayr was automatically applying in the Tables for Data

Personally, I don't like this type of approach. In this example, it's just obviously wrong. People cannot have a negative weight. And, they're really unlikely to be real people if their weight is more than 1052 pounds!

Lots of similar rules have been invented. For example, we can change it to 2.5


This does improve things, but it's ovbiously ridiculously subjective, and there are other problems when we would need a different rule.


Examining histograms/density plots

My preferred approach is to create a histogram or density plot.

What we are really looking for is one of the following two problematic shapes.

Multiple points.

Or, some outliers.


Numeric case study - weight in points

So, we will turn our table into a histogram.

Visualization switcher > Categorizable histogram.

It's pretty easy to spot the problem now. We've got a cluster of results near 999.

Let's look at the unique values

It's pretty common that when interviews are conduted by phone or face to face, values of 9 hundred and ninety are used to indicate different types of missing values, such as Don't know, and Refused.

So, we will set them to missing values.

Set 990+ Exclude from analyses

That looks a lot better.

One little thing I like to do is to increase the number of bins


Drag to about 436

The pattern we are seeing is known as shelving. Notice that we have high values n every multiple of 10, and almost as high in fives, which tells us that people are using their memory and judgment, rather than reporting current measurements.



So, we've now cleaned our weight data, but how do we tidy it to make it easy to work with? One approach is to convert it into categories. This is known as banding.

For example, a popular way is as percentiles.

Click on Weight in Ponds > Ready Made New Variables >

Automatically Combine Variables > By Value > Percentules

This has done it as deciles.

Let's change it to quartiles.


Text variables

In the beginners guide, we talked about how sometimes text vairables shouldn't be text variables.

Another problem that can be detected in text data is duplicates. If two people have said the same long thing, it's a sign there's a problem.

Another issue is poor quality text.

We asked people what they disliked about their phone company.

Person number 5 typed garbage.

If the text contains garbage, it's a good chance that the rest of their data is garbage, so should just be deleted.

Percentages: 4

So, this is the best guess at quartiles.

Or, we could do equally spaced categories.

Method qually spaced categories

Number of categories: 4

But, the cutoffs aren't very neat, so we can change them to round numbers.

Method: Tidy categories.


Back coding

The first thing I will show is back coding. It's pretty simple, but I'm always surprised by how many people aren't familiar with it.

In this example, people had an Other option and 1 person selected it.

Back coding is the process of combing the categories with the raw responses.

We click on the variable containing the other specify responses

Click on Colas > Living arrangements - Other > + > Text Categorization> Manual > Mutually Exclusive > New

We then hook it up to the closed ended data


Inputs and Back Coding

Corresponding… :Living arrangements

Right-click on New Category > Delete

OK, so we have things single mum. She shouldn't be in other. She should be in Living with children only.

Click Living with children only.

Save categories.

A new variable has been created. We will use it to replace the other variable.

Now, we have 0% in Other, which is correct.

Coding categorization

Market researchers refer to the process of categorising text as coding.

Here we've got some data on reasons for disliking phone companies.

Data Sets > Mobiles feedback ... > Dislikes… > Text Categorization > Manual Categorization > Mutual > New

When coding text data we have to decide if each respondent is permitted in one and only one category. That is, single response coding, where the categories need to be mutually exclusive.

Or, whether we permit multiple response coding with overlapping categories

For this example, I will code people into only one category

Mutually exclusive > New

272 people said Nothing, so I will create a category for them.

Click on New Category

Rename: Nothing

Let's create a category called service

Create category: Service

We've got a webinar which shows you how to efficiently do this so I won't do it now.

Save categories.

Once you have done your coding, you get a new variable.

Drag Dislikes - Categorial onto page.

And you then analyze the new variable just like any other quantitative variable



OK, Now we are will do a deep dive into examining some missing data.

Driver analysis with missing values

Here's a driver analysis. Look at the footer. We've got 482 - 372= 110 missing values, which have been ignored by the regression.

Is this a problem?


Little's MCAR test

Perhaps the greatest statistician of missing data, Roderick Little, invented a statistical tests to check if we will get the wrong conclusions by ignoring missing values.

Search: Little

As the p-value is less than 0.05, it suggests that the regression where we've just ignored the missing values is a problem.

What can we do?

Let's start by taking a closer look at the missing values.


Visualize missing values by case

This visualization can sometimes reveal problems. It's a heatmap where we have one row for each respondent, and one column for each variable.

Anything > Data > Missing Data > Plot Missing Data by Case

When this plot is most useful, is when it shows big blocks of missing values. Bu, that hasn't happened here.

Visualize missing values by pattern

Anything > Data > Missing Data > Plot by Pattern

Uncheck variable names

This one's a bit harder to interpret.

This column of bars shows us the frequency of patterns, with the biggest pattern at the bottom.

The most common pattern is that of having no missing data, which is good.

The next most common pattern is that 33 people are missing Q17_5. In a real world study we would want to work out if that means anything interesting.



Our situation is this. We've got missing values. We know we can't ignore them, due to Little's MCAR Test.

We haven' found any pattern to suggest there is some big problem.

What should we do?

One option is to impute them, which is statisican-speak for guessing what they should be.

Here's a very small data file.

Select the first 3 variables in Imputation Example

View in Data Editor

We've got no age data for the first six people.

Imputation refers to replacing the missing values with our best guess as to the correct values.

If you had to impute the age for respondents 1 and 2, what would your guess be?

One way of solving this is to look at the other people who are the same. That is, all the other respondents that are Not married and have no children

Select rows 8, 19 and 20

So, our best guess may be that more will be 18 to 44 than 45 or more.

Let's get Displayr to compute.

We start by selecting the data we wish to impute, Age.

+ > Read-Made Formulas > Age

Mmm.. It doesn't seem to have done a good job. That is, we should be seeing more people aged 18 to 44, but we aren't.

When we click on the newly created variable, we see the problem.

We have attempted to impute age. But, we worked out that most people should be 18 to 44 by looking at Married and Has Children. So, in order for our imputation to work well, we need to allow it to also use this information.

Auxiliary variables

This creates a new variable using a very cool predictive methodology called Multivariate Imputation via Chained Equations, or Mice for short.

And, it's now doing a better job.

If you think about it for a while, you'll realize that there's something a bit fishy here.

We is respondent 1 45 or more, rather than, say, respondent 2?

Good imputatoin algorithms make this type of decision using randomization, as they want to ensure that the imputed variables contains sufficient variation.

We can undersand how random variation affects things by playing arond with this seed option, which mixes up how the randomizatoin is done.


Multiple imputation

Returning to our regression, we've still got our missing value problem.

We could using imputation to fix our input variables.

But, this is fully automated within regression.

Click on the regression

Missing data = Multiple imputation

What multiple imputation does is that it first imputes the variables, then runs the regression, then re-imputes the data, and runs the regression again. It does this 10 times and then takes the average result from the 10 regressions.

Please check our our webinar on driver analysis to learn more about this.


Creating new variables

Along the way we've already looked at lots of different ways of creating new, clean and tidy variables.

But, I will share some cool more advanced approaches.

Mathematics on sets of variables.

Phone > Expand hours:

In this data file, I've got 18 variables measuring the number of hours that people spent doing different activities.

And another 18 about minutes.

Expand minutes

I need to create new combined variables.

Let's look at the minutes.

Step 1 is that I need to convert the minutes to hours, by dividing by 60.

For example, the 11.9 in the first row of the table should become 0.2

Select minutes

Calculation > Divide

Enter 60

So, we've got some new variables.

Drag across new minutes and replace current table

Cool. Now, we need to add this to the data on hours.

I'm going to do this in a really, really, complicated way, just to showcase a super advanced feature.

I will click on he first variable

Expand R CODE

Over here on the right, it shows me the code it's used to do the division. I'm just going to modify this, and add it to the hours data. I will write a plus.

Go to end of last line.


And, click on my hours.

Click on HOURS.

And now you can see the results have updated, and with just a little bit of code we've done the math on 18 variables.

And, we need to be tidy and update the label.

Label: Hours spent on activities

And we'll hide the other variables so they aren't inadvertently used in any analysis.

Select Hours and Minutes



Identifying low quality responses

We've already reviewed a fe approaches to identify low quality responses.

Another approach is to use lie tests. For example, one way that psychologists work our who fills in forms dishonestly is by asking if people never lie. Anybody who says they never lie is flagged as a liar.

A more traditional market research approach is to see if they claim to buy a brand that doesn't' exist.



Speeders are people that coplete a survey in an impluasibley short time.

As with other numeric variables, we look at duration using a hisogram.

This histogram suggests we've got no problem with speeding in this survey.


Model-based outliers

Returning to our earlier regression model, with most predict models, there are ways of identifying respondents that do not fit well with the rest of the data. Such data have various names, including outliers, unusual observations, and influential observations.

You can see a warning here. When we click on it, we are being warned that we have such observations.

So, we need to find out who they are.

There are lots of different definitions of outliers, so I will just choose one to show you the basic idea.

Press - Plot Cook's Distance.

So, this tells us that we want to examine observations 74, 288, and 418.


Identifying low quality respondents in MaxDiff and Conjoint

In maxdiff and conjoint, one way of identifying low quality respondents is to see how well the model predicts their choices.

My preferred way of doing this is using something called the Root Likelihood.

RLH Root (likelihood)

This adds a new variable.

And, we'll create a histogram of it.

Switcher > Categorized histogram

In this study, there were five alternatives, so we would expect an RLH of at least 0.2. However, if you look at the histogram, it suggests a cutoff of 0.4.

A note of caution with measures like this. There are two reasons why you get poor predictive accuracy for some respondents

One is that they've given bad data

Another is that the model itself is not right

Here we have some technical warnings, so we really need to first check these.


Flatlining / Straightlining

When people choose all the same options in a set of related questions, they are said to have flatlined or straightlined. For example, respondent 37 chose the highest option on every single question.

Pages + > Report > Straight lining

This report lists the various question and the proportion of respondents in them who have flatlined or straightlined.


So, we are almost at the end!

We've already looked at lots of ways of automating, checking, cleaning, and tidying.


Combining variables

If you have a huge study, principal component analysis is a great way of replacing lots of variables with a small number.


Drag across predictors for customer feedback.
For example, this analysis says we could do a reasonably job if we replaced the 9 different variables measuring performance here, with just two factors.

Check our our webinar on factor analysis for more about this.


Tests for tracker

On the left here is a table showing gender by week.

The corrected p is showing if there are any statistically significant differences. A value of 0.05 could indicate that my sample is imbalanced by gender.

Now, here's a simple test for a tracker.

Here's a bit of code which just checks to see if any of he p-values are less than 0.05.

What's the point of this?

Well, I can update my tracker, and I will get an error if anything changes.

Select customer feedback

Update to 12 weeks

When I update you can see that there are significant thigns on the tale

And, we've got a big error on the page, so we can't miss the problem.



So, this is the overview I presented at the beginning of the webinar.

Read more

Cookies help us provide, protect and improve our products and services. By using our website, you agree to our use of cookies (privacy policy).

The complete solution for cleaning a survey

See how!