How to analyze survey data

Fast-track your survey analysis skills and learn all the basics of finding and sharing your data story.

This 30 min webinar, run by Tim Bock, is for people new to survey analysis.

In this webinar you will learn

  • The best data files
  • Cleaning and formatting
  • Creating filters and weights
  • Analysis plans
  • Stat testing
  • Finding the story
  • Reporting the results


I'm going to walk you through the basics of how to analyze a survey.

We've put this together for survey analysis beginners. If you've never gone through all the step in analyzing a survey on your own, this webinar is for you.

Case study

And, we will do it using a case study, where we explore how likely people were to buy this product concept.

Please take a moment to read it.



I will take you through eight stages of analyzing a survey.


Getting the right type of data file

The first stage in analyzing a survey is getting data in the right format.

This is the first big mistake that people make when they analyze surveys.


The wrong type of data

They click the export button in their data collection software and get an Excel or CSV file. And they try and analyze this data.

But, Excel and CSV files weren't invented for survey analysis. Yes, you can use them, but you will double the time it takes to do analysis and you will likely make lots of mistakes.


Good data file formats

There are three great data file formats specifically designed for surveys:

  1. The most well known was invented many years ago by a division of IBM called SPSS. This file format is the industry standard and is used by just about all survey analysis software.
    These files are called SPSS data files or dot s a v files.
  2. Two even better file formats are MDD and Triple S. They're a bit harder to get.
  3. And, if you are using Displayr or Q, you can do even better and import your data straight from SurveyMonkey, Decipher, and Qualtrics into Displayr.

But, let's say you just can't get one of these better data files, and need to use Excel or CSV Files.

You can still do good analysis, but it will be a bit slower, as you will need to spend more time tidying it up.


What marginally OK data looks like

The good file formats all ensure the data is neat at the beginning.

If you are using bad file formats, you often need to set it up manually so that it's in good shape.

You need

  1. Row 1 to show what the data means
  2. You want a unique ID for each respondent
  3. Each respondent's data should be in one and only 1 row
  4. Where you can, store the data numerically.
  5. If you have multi-pick questions, you want them set up with 1s and 0s, like shown here.


What terrible data looks like

Terrible data is data that looks different to what I just described. If you try and import data like what's on the screen now, you will end up with lots of problems.

What impossible-to-analyze data looks like

And, if your data is already structured as multiple tables, you're basically stuffed, and you will just get error messages if you try and import tables like this.

You need what's called the raw data. That is, data with 1 row for each respondent, as I just showed you on the earlier slides.

I will import an SPSS dot s a v file now

Add data set > My Computer > Resources - Documents\Data\Concept Test\iLock.sav

The data file's variable sets are shown here on the bottom left.

Each of these shows either a question in the survey, or, some other type of information that has been stored in the data file.


Cleaning and tidying by variable set

Now that we have data, our next step is to clean and tidy it.


How to clean and tidy a data set

There's a basic process for cleaning and tidying.

We create a summary table each variable set.

We check each summary table looking for dirty data. What this means is shown below. There's no need to read this now; we are going to do it for the data set shortly.

And, if the summary table is dirty, we then need to do some data cleaning.

Let's start by creating the summary tables.

Insert > Report > Summary Tables


Response ID

In a survey, each person is typically given a unique code. This is what's shown here. This is what it should look like. All is in order.


Duration (in seconds)

This is how long the questionnaire took to complete, on average, in seconds. It's a bit hard on my hard to read in seconds.

We can tidy this up by converting it to instead show the data in minutes.

So, we will divide this by 60.

Insert variable > Ready-Made Formulas > Divide
Enter 60

Let's have a look at the data.

That's an average of a bit over 10 minutes.

We want to look at the minimum .

Statistics > Cells : Min

So, the fastest person did the survey in a little under 4 minutes. That's plausible. If the numbers were implausible we would need to delete the data with the implausible values.


User language

We've got data on user language. It's showing us the raw data rather than a summary table.

This is because whoever created the data file, set it up to show this data as if it was text rather than categories.

We can change this to instead show a table of percentages

Structure > Nominal

So, 100% of people doing the survey speak English



Note that Displayr's showing percentages and the counts, which are the number of people to choose each option.

Both are useful, but most of the time it is the percentages that are most useful. A mistake that novices make is to report the counts instead. This isn't so interesting, as who cares if 134 people are Male?

When we look at percentages, it's more interesting. This says that 55% of adults in America are Female. If correct, that's a useful thing to know. This is the goal of surveys. To estimate things about the world outside of the survey itself.

The correct value in the popualtion for this data is 51% for females, but it's not too badly skewed, so it's not a problem.



The survey was only asked to adults. So, this first category isn't interesting. Tidying in this case means removing it

We've no huge difference between the ages, so that's fine.



It's usually better to look at this data as a map.

Insert > Visualization > Geographic Map

We can see that the biggest states are calfifornia, Texas, Florida, and New York, so this looks basically right.


Population density

The bottom category's pretty small.

Only 9 people.

That's too small for useful analysis.

We need to merge the bottom two categories.



We will merge these too



We will merge the smaller categories



And, these other labels are too long and will make our report messy



Let me first rename the variable.

GENERAL > Label <> Income

We've got a lot of income categories.

One option is to merge them. But, a better option is to treat the data as being numeric

It's showing me an average income of 17.7. That doesn't make sense!

We need to look at the data values to better understand

Press DATA VALUES > Values

Ah, the way the data has been set up, an income of less than 1000 is a 1, 1000 to 2999 is 2, and so on.

What we can do is replace these values with midpoints. For example

1 -> 500
2 -> 2000

This is called midpoint recoding. We can do this automatically.

TRANSFORMATIONS > Numeric Variables from Midpoint Categorization

Appearance > $

That makes more sense.

What, if anything … like

You will recall we showed a description of an iLock

We asked them to say, in their own words, what they liked.

When we have text data, we need to categorize this into groups, so that we can then summarize the data like any other data. In survey research, this is often called coding.

Insert > Text Analysis > Manual > Multiple overlapping .. > New

OK, so the first response is garbage.

I will create a category to store poor quality data, as we will want to delete these respondents later.

We've got 109 people that said Nothing

The basic idea is that you read through the responses and categorize them by judgment.

Now, I won't bore you by making you watch. I did it earlier, and I'll load it now. Please check out our webinar on text analysis if you want to know more about hwo to do this.

Import > Resources - Documents\Data\Concept Test > iLock Likes.Qcodes

Now, this causes new variables to be added to the data set.

I'll give you a moment to read what people liked about the iLock

Now, if you look at 3rd row from the bottom of the table, you can see that 5% provided poor quality data.

To get a better idea of what that means I'm going to filter the raw text, so it only shows the poor quality data

Inputs > FILTERS & WEIGHT > New

I've chosen the data we just created.

So, we asked them what they liked about the product. And, the 5% people basically told us junk.

If you think about how surveys work, all the previous questions just asked people to choose options. We have no way of checking if they chose sensibly. This is the first opportunity to see if the people are doing a good job answering, and these 16 people haven't.

So, the right thing to do is to delete all their data. If they have given us garbage here, we can't rely on anything they've said.

Note that we currently have 16 of 300 rows selected.

I am going to delete all the responses that contain the bad data.

As you can see, our sample size has now reduced to 284.

Look at the table. The Poor quality data category now shows 0%.

What, if anything, do you … Dislike

Here we have asked about dislikes.

As with the other text data, we should categorize it. But I won't bore you by doing it while you wait. We need to code it.

Insert > Text Analysis > Manual > Multiple overlapping .. > New

As with before, I've already done it.

Import - iLock.Dislikes > Save categories

This time it shows 0% with poor quality data. But, in surveys, you need to be a bit carefully when you see 0%, as there can still be people.

Statistics > Cells > Count

Ok, so one person with poor quality data.

Let's have a look at his data


We'll need to give this a unique label

So much for them all being English speaking!

I will delete him as well.

Note that we currently have 16 of 300 rows selected.

I am going to delete all the responses that contain the bad data.


Which phrase

This is usually called Purchase intent

I'll change the name of the page and the underlying data to match this

Variable > Purchase intent

This data is now clean and tidy.


Compared with similar

This is often called uniqueness

Variable name: Uniqueness

We will return to this table later.

How well…

This is often called Brand fit

Title: Brand fit

There are too many categories. We should merge some of these categories


How likely …

This is priced purchase intent. Nothing to do here other than change the label of the data


Browser meta info - Browser

This tells us what type of browser they were using.

Let's look at this as percentages

Data Sets > Browser Meta … > Structure > Nominal


Techniques for cleaning

We've just gone through the process of cleaning and tidying. This page summarizes what we just did.



Next comes weighting. It's also known as sample balancing, calibration, raking, and post stratification.

The basic idea here is that, in a survey you will often end up under representing some groups in the population.

I've looked at the census data for gender and age , so will just look at them.

When I compare this to the Census, I see we have a few too many females. We should have 51%, not 56%.

Too many 18 to 24s, and too few people aged 55 to 64.

But, none of the differences are huge, so we don't need to weight it.

If you do want to know how to weight, we've got both an ebook and a webinar next week on it



Filtering is the process of running analyses on only a subset of the data. You will remember we earlier filtered our text data to look at the low quality responses.

We will do more filtering later.


Overview - Planned analyses

This next topic is the thing that really separates out expert survey researchers from the rest.

Well before you look at your data, you need to very carefully identify the key things you need to work out.

What novices do instead is they write a questionnaire but don't ever take the time to work through how they are going to analyze it, and this causes trouble when it comes time to do the analyses.

The specific plan that you will have depends entirely on what you are interested in. There's no standard plan.


Analysis plan

Here's a simple analysis plan for this survey. I will work through it.

The first thing is, is the concept viable?

12% have said they would definitely buy it.

People tend to exaggerate how likely they are to buy things, so you need to compare this data to benchmarks typically. The benchmark I'm using for this survey is 25%. So, we are a long way behind benchmark.

OK. What's next? We need to compare our purchase intention priced versus unpriced.

Go back to The iLock is a Loser

Let's look at the other bits of data that we planned to look at

But, before I do this, note that we've got 44% of people saying they would definitely not buy.

If we are going to find opportunities to improve the product to make it more appealing, we need to focus on these three middle categories. These people that aren't definitive one way or another.

Here's our table from before of Dislikes.

We're going to filter it now, and just look at the data of people that said they Probability Would buy or Might or Might not buy, or Probably wouldn't buy

Data > Purchase intent

What we want to see is one big category of dislikes, as then we know what we need to focus on. The two biggest dislikes are the brand apple and concerns about Security. But, they're both pretty niche.

Let's look at uniqueness.



Most people are viewing it as somewhat different. So, the problem isn't that it's perceived as a "me too" product.

Brand fit

Only a few people are thinking it fits poorly with Apple. So that's not the problem.



The most used tool in survey analysis is the crosstab.

Let's say we wanted to understand if purchase intent differs by gender.

We can do a filter.

Let's filter for men.

Now, let's filter for women.

Looking at this we can see that the females are a bit more likely to say Definitely will buy, with a score of 13% versus 10%.

Now, when you are wanting to look at surveys, you are always wanting to do analyses like these, so we need a faster way than filtering.

We could filter the data first for males, then for females, and compare.

But there is a faster way.

This is called a Crosstab. Each column has a separate filter, in this case based on the categories in Gender.

You can see it says Column % in the table. This is to remind us that the filters are in the columns.

The next question we have to ask is this: how meaningful is the difference between the 10% purchase intention for men versus the 13% for women?

Is the difference reliable? Or, is it just a fluke.

Fortunately this is a topic that the whole discipline of statistics has focused on solving.

The arrows are telling us whether the differences between the filter groups are reliable enough to tell other people about. There's no arrows in the first row, so we can't conclude a difference between the I would definitely buy it scores of the men versus the women.

Yes, we do get what's called a significant difference in the third row, but as that row's not very interesting, this significant difference is immaterial.

When you do surveys, you tend to have to do lots of analyses like these. So, we can automate the process further. I'm going to automatically create crosstabs comparing purchase intent by all the demographics.

Insert > More > Tables > Lots of crosstabs
Rows: Purchase intent (priced)
Columns: Gender … Income RECODED

Here's the difference by gender we saw before.

That's more interesting. Purchase intention is strongly related to age. Note here that I'm looking at the first number in each cell, which is the Column %.

25% for the 18 to 24s. All the way down to 0% for the 5 or older.

We've got a significant difference in Alabama. But, there's only 5 people in Alabama in the study, so I'm going to ignore it

There's no difference by population density

In the all important I would defniitely buy it, there's no difference by education

There is a higher purchase interest among blacks

Possibly also among asians, but as we only have 15 of them in the sample we need to be quite cautious. Even with the black group, 37's pretty small as sample sizes go.

There's no arrows, so no difference by income


Stat testing / statistical significance

We've just done stat testing.

Now, we move onto finding the story.


Finding the story

The pope of the day asked Michelangelo how he'd carved this most famous of all statues.

He said. "It's simple. I just remove everything that's not David."

This is also the key principle of doing useful analysis and reporting.

We just go and delete everything that's not interesting.

This duration data's not interesting.

Let's get rid of this (user language)

Gender: Remember, we create the sample profile slide before

Somebody may ask about this, so we need to keep it

This isn't interesting

Analysis plan: We need to keeep this

iLock is a loser: We need to keep this

Crosstabs down to "It's simple…"

We want to structure the information so that the key bit's at the very beginning.

Then, supporting informatino, and then, more detail.

Like a pyramid.

And, gloss it up a bit

This is a mix of the key conclusion and some supporting material, so we need to pull it apart

And, we need to make the age pattern we found before clearer

Chart > Stacked column chart
Appearance > Highlight > No

If we were brave and strong, we would be like Michelangelo and delete everything else. It's just rubble. But, if a little less brave, we can go with the pyramid.

Insert > New Page > Title Page

So, we're done. We can either create a dashboard. Or, export it to PowerPoint.

Read more

Cookies help us provide, protect and improve our products and services. By using our website, you agree to our use of cookies (privacy policy).