Correspondence Analysis - Displayr

Correspondence analysis is perhaps the most widely used multivariate tool in market research. It's our "go to" tool for displaying complex tables, such as brand association tables. A practical challenge with correspondence analysis is that it is designed to best show all of the relationships in the data, but sometimes we are more interested in one particular brand. That is, we want to focus our attention on finding insights that relate to our (or our client's) brand. This can be achieved by rotating the correspondence analysis.

Case study: carbonated soft drinks

The visualization below shows the map created by a correspondence analysis of a segment of the Australian carbonated soft drink market. In the top-right we have the highly caffeinated energy drinks, all clustered together and owning energy-related attributes. Fanta appears in the top left, being for Kids and Fun, while Coke, Pepsi and Lift sit near the middle. When a brand sits near the middle of a map it means that the map isn't doing a great job at describing what makes it unique.

As far as maps go, this one is pretty good. It explains 77.5% + 17% = 95% of the variance that can be shown by a two dimensional correspondence analysis. Usually in data analysis 95% is pretty good. But, we're interested in finding out what has been lost. Are there any interesting insights hiding in the missing 5%?

When doing a correspondence analysis, it is possible to compute the quality of the map for each of the individual points. In this case study, we're interested in brand. If we compute the quality (which is a standard output in nearly all correspondence analysis software), it shows that, in increasing order, the quality is 68% for Lift, 69% for Pepsi, 78% for Diet Coke, 87% for Coke, and 99%+ for the other brands. Note that the overall map quality of 95% is not the average of the individual quality scores. Correspondence analysis focuses on the brands with the strongest relationships, and these stronger relationships make up the lion's share of the 95% value.

Imagine that you were the brand manager for Lift. Looking at the map, we can see it seems to be a bit like a Cola. It is as different to Coke as is Diet Coke. And, it seems to be associated with the attributes Refresh and Relax, although it is a bit hard to be sure given the quality scores. (If you are new to correspondence analysis, please read How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for more detail on how to correctly interpret such a map).

Rotating the correspondence analysis

Just like with factor analysis and principal components analysis, it is possible to rotate a correspondence analysis to make patterns in the data easier to see. In the map below, I've rotated the correspondence analysis so that it gives a 100% quality score for Lift. (The math is a bit complex, but you can find the equations here.)

The resulting map (displayed below) now shows 83% of the variance, so it is not as accurate overall as the earlier map. This is not a surprise. The standard correspondence analysis without rotation maximizes the variance overall. In making the map focus on Lift, we have implicitly accepted that we are willing sacrifice overall accuracy in return for getting a better read on Lift's position.

Note that the broad features of this map are the same. The high-caffeinated brands are still grouped together, as are the colas, and Fanta is off on its own. However, the conclusions have changed for Lift. We can now see that Lift is much more on its own than implied by the previous map. And, we can see that it is relatively strongly associated with Refreshes, and only modestly with Relax.

Doing it yourself

Rotating correspondence analysis to focus on a particular brand is a new technique. Our paper describing how it works has only just been published. However, if you want to do it yourself, there are some easier solutions than reading the paper. If you know how to use R, we've open-sourced all the calculations here. You can also do it for free in Displayr. And, you can do it in Q as well. Please reach out to me if you need any help.

Focusing the Results of Correspondence Analysis in Displayr

Jake Hoare — Mon, 29 Oct 2018 06:17:40 +0000

Create your own Correspondence Analysis

Correspondence analysis outputs consist of coordinates (usually plotted on a scatterplot) that explain the most variation across all of the brands. When we are interested in a specific brand, it can be useful to use focused rotation, described below. This is a novel technique that we have developed, described in the paper A brand’s eye view of correspondence analysis published in the International Journal of Market Research.

Start your engines

The data we are using describes the characteristics that people associate with cars. The input table below is labeled by 14 car brands along the rows. The columns are labeled by characteristics. Each cell indicates the strength of the association between a characteristic and a car.

The chart below shows the correspondence analysis resulting from this data. In Displayr it is created from Insert > Dimension Reduction > Correspondence Analysis of a Table. The data is plotted with normalization of principal coordinates. This means that we can compare distances between column labels and distances between row labels, but not the distance between a row and a column label. See this post for a more in-depth discussion about normalization and interpretation of correspondence analysis.

Create your own Correspondence Analysis

The dimensions output by correspondence analysis, are in decreasing order of variance explained. This means that later dimensions explain smaller portions of the variance. The chart shows only the first two dimensions, which for this example, capture only 53.4% of the variance. So the hidden dimensions contain a reasonable amount of information. Importantly, from the plot alone we cannot tell how much information about any given point (brand) is retained.

Our first car

As an example, Mini Cooper is relatively close to the origin. This could be because it is poorly represented by the two plotted dimensions. Or it could genuinely be the case that Mini Cooper is close to the origin in all dimensions.

If we were performing this analysis to find the relationship of Mini Cooper to the other cars and characteristics, we could not draw any strong conclusions from this plot. The best we could say is that in the first two dimensions alone, there is little to discriminate it.

Quality of the map

We can create a table showing how much variance is represented in each dimension with Insert > Dimension Reduction > Diagnostic > Quality Table. The resulting table (below) shows the variance in the first two dimensions before the row label of each car. Since Mini Cooper has only 16%, we can now say that the plot above hides much of the information for this brand.

Making a sharp turn

In order to find out more about the Mini, we rotate the results so that all of its variance is in the first dimension. This means that there is no longer any hidden information about this point. We shift the focus of the output onto Mini Cooper.

In Displayr this is done by entering Mini Cooper in the box labeled Focus row or column. The effect of the rotation is shown below.

Create your own Correspondence Analysis

In this case, correspondence analysis produces embeddings in 5-dimensional space. If you find this difficult to visualize, join the club. What matters here is that there is no longer any hidden information about Mini Cooper. We can now see that it is more related to Fiat 500 than the other cars. This makes intuitive sense, as they are both small cars. We have gained insight by focusing on what differentiates Mini Cooper from the other cars.

However, note that the chart as a whole explains 46.3% of the variance in contrast to 53.4% in the first chart. The price we pay for the rotation is that the first two dimensions no longer contain as much variance as possible about all of the data. It is no longer the best general representation of all the points.

Buying a new car

As another example, let's rotate to focus on the VW Golf. Notice how the plot below is very similar to the original, except rotated on the page.

This rotation is easier to visualize. We have turned the page clockwise by about 135 degrees and the relationship between VW Golf and the other cars has been closely maintained. The total variance explained has dropped by only 0.1% from the original plot. All of this tells us that VW Golf was well represented originally. This confirms the 99% variance in the first two dimensions from the quality table above.

Create your own Correspondence Analysis

TRY IT OUT
The analysis in this post was performed in Displayr. Click here to open the Displayr document containing all the analysis in this post. You can amend it or try it for yourself.

The flipDimensionReduction package (available on GitHub), which uses the ca package, performed the correspondence analysis.

The car data is from a latent feature analysis performed in Meulders, M. (2013). An R Package for Probabilistic Latent Feature Analysis of Two-Way TwoMode
Frequencies. Journal of Statistical Software, 54(14), 1-29. This analysis uses data from Van Gysel, E. (2011). Perceptuele analyse van automodellen met probabilistische feature modellen.
[translation from Dutch: Perceptual analysis of car models with probabilistic feature models] Master thesis. Hogeschool-Universiteit Brussel.

3D Correspondence Analysis Plots in Q

Tim Bock — Thu, 27 Sep 2018 09:46:33 +0000

Get your free Correspondence Analysis eBook!

The data

In this post I use a table of the following Pick Any - Grid.

Correspondence analysis

To create a correspondence analysis plot in Q, follow these steps:

Create a table. With a grid like this, this is done by creating a SUMMARY table. However, you can also create a crosstab.
Select Create > Dimension Reduction > Correspondence Analysis of a Table.
Select the table to be analyzed in the Input table(s) field on the right of the screen.
Check the Automatic option at the top-right of the screen.

You will end up with a visualization like the one here. Note that this plot explains 65% + 21% = 86% of the variance that can be explained by correspondence analysis. Fourteen percent is not shown. This fourteen percent may contain interesting insights, and one way to see if it does is to plot a three-dimensional labeled scatterplot.

Get your free Correspondence Analysis eBook!

Interactive 3D scatterplot

We now need to write a bit of code - but don't worry! We just need to cut and paste and change a few characters.

Go to Create > R Output.
Copy and paste in the code shown after point 4 on this page.
Replace my.ca with the name of your correspondence analysis. If you right-click on the correspondence analysis in the report tree and select Reference name you will find the name (you can modify the name if you wish).
Check the Automatic option at the top right of the screen.

 
rc = my.ca$row.coordinates
cc = my.ca$column.coordinates
library(plotly)
p = plot_ly() 
p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3],
              mode = 'text', text = rownames(rc),
              textfont = list(color = "red"), showlegend = FALSE) 
p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], 
              mode = "text", text = rownames(cc), 
              textfont = list(color = "blue"), showlegend = FALSE) 
p <- config(p, displayModeBar = FALSE)
p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]),
           yaxis = list(title = colnames(rc)[2]),
           zaxis = list(title = colnames(rc)[3]),
           aspectmode = "data"),
           margin = list(l = 0, r = 0, b = 0, t = 0))
p$sizingPolicy$browser$padding <- 0
my.3d.plot = p

You will now have a 3D plot like the one below. You can click on it, drag things around, and zoom in and out with the scroll wheel on your mouse.

Explore 3D Correspondence Analysis

Sharing your 3D scatterplot

If you export this visualization to PowerPoint it will just become a picture, and will forget any changes you made. The best way to share this visualization is to export it to Displayr. Sign up is free, and allows you to create and export dashboards to web pages, which can then be shared. Click here to go into a Displayr document which contains the visualizations in this post - click the Export tab in the ribbon to share the dashboard.

See these examples in more detail here, or to learn more, check out our free Correspondence Analysis eBook!

Get your free Correspondence Analysis eBook!

How to do Traditional Correspondence Analysis in Displayr

Chris Facer — Mon, 24 Sep 2018 14:36:18 +0000

Click here for an interactive tutorial on Correspondence Analysis

There are a few variations on the technique of correspondence analysis (including correspondence analysis of square tables, multiple correspondence analysis, and correspondence of multiple tables), but in this post I focus on the most common technique, which could be called traditional correspondence analysis. This is a technique originally derived to understand the patterns in contingency tables, but it can readily be applied to other kinds of data as well. In this post, I show you how to set up your correspondence analysis in Displayr.

Like all data analysis, there are a range of issues to keep in mind as you conduct your analysis and interpret the results. For a deep dive on the topic area, check out our eBook and other posts on the topic.

Step 1 - Create your table

The starting point of your analysis is the table of data that you want to analyze. While the original application of correspondence analysis was for contingency tables of counts, the technique works effectively for a range of other kinds of data so long as the data is all on the same scale. This includes crosstabs showing counts, percentages, or averages, grids of data created from binary variables, and even raw numeric data.

There are three main ways that you can add a table to Displayr:

Paste in your data.
Use Displayr's built-in statistical engine to compute the table from raw data.
Use R to compute a table.

I will briefly explain how to do these below.

Click here for an interactive tutorial on Correspondence Analysis

Option A - Paste in data

The simplest path to a correspondence analysis is when you already have the table you want to analyze. In this case you can just paste it right in. To do so:

Select Home > Paste Table.
Click Type or paste data in the Object Inspector on the right side of the screen.
Paste in your table of numbers into the spreadsheet-style interface (like below) and click OK.

Your table will appear on your page. If you don't need to use the table for another analysis or visualization, you can also paste this data in to the correspondence analysis directly.

Option B - Use the built-in statistical engine

Displayr has a powerful engine for computing tables (and charts) from raw data. Before you can use it, you must have a data set added to your document. To add your data, select Home > New Data Set.

Once you've got a data set loaded, use the following steps to create your table:

Click Home > Table.
Choose the data that you want to show in the table using the Rows and Columns menus (sometimes called Show and By depending on the type of data you have selected) in the Inputs > DATA section in the Object Inspector on the right side of the screen.
Choose which statistic you want to analyze using Inputs > STATISTICS > Cells. You should select one statistic for your table.

For example, here I have created a table based on some data from a survey about technology brands. The table shows the devices people own across different income brackets.

Option C - Calculating tables with R

You can run your own custom R code in Displayr using Insert > R Output. The code that you run will depend greatly on what data you have at hand, and what kind of table you want to create. Examples include:

Using a function like table() to calculate a contingency table.
Using data.frame() to construct a data frame object containing raw data.

Whichever method you use, the mechanism for connecting the data to the analysis, described in the next section, is the same.

Click here for an interactive tutorial on Correspondence Analysis

Step 2 - Run your analysis

Now that you have your data in a table, you can add a correspondence analysis output to your document:

Select Insert > Dimension Reduction > Correspondence Analysis of a Table.
Click into the Input table(s) box in the Object Inspector on the right, and select the table you have created above.
Remove any additional rows which correspond to 'nets' or 'totals' by adding the corresponding row/column labels in the Rows to ignore and Columns to ignore sections on the right. These should typically not be included in the analysis, and Displayr automatically removes the default ones.
Customize your title, colors, fonts, and grid lines using the settings on the right.

The map will appear as a scatterplot on your page.

For more on how to interpret a chart for a correspondence analysis, see How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think).

Moon plot

A nice alternative to the standard scatterplot output of correspondence analysis is the moonplot. To display a moonplot:

Go to the Object Inspector on the right and change the Normalization setting to Row Principal.
Change the Output option to Moonplot.

The moonplot for my brand image grid looks like this.

The moonplot shows the rows of the table (the brands in this case) in the center, and the column of the table (in this case the attributes) around the edge of the circle. For reasons explained in Moonplots: A Better Visualization for Brand Maps, the moonplot can be easier to interpret than the standard chart.

Click here for an interactive tutorial on Correspondence Analysis

Ready to make your own correspondence analysis? Click the button above, or sign up to Displayr for free here!

3D Correspondence Analysis Plots in Displayr

Tim Bock — Thu, 13 Sep 2018 17:02:33 +0000

Explore 3D Correspondence Analysis

Traditional correspondence analysis

Traditional correspondence analysis plots typically plot the first two dimensions of a correspondence analysis. Sometimes, additional insight can be gained by plotting the first three dimension. Displayr makes it easy to create three-dimensional correspondence analysis plots.

The data

In this post I use a brand association grid which shows perceptions of cola brands.

Creating the correspondence analysis

The first step is to create a correspondence analysis. In Displayr, this is done as follows:

Create a table of the data to be analyzed (e.g., import a data set and then press Insert > Table (Analysis)).
Select Insert > Dimension Reduction > Correspondence Analysis of a Table.
Select the table to be analyzed in the Input table(s) field in the Object Inspector.
Check Automatic (at the top of the Object Inspector).

This should give you a visualization like the one shown below. You can see that in this example the plot shows 86% of the variance from the correspondence analysis. This leads to the question: is the 14% that is not explained interesting?

Create your own Correspondence Analysis

Creating the interactive three-dimensional visualization

Insert > R Output
Paste in the code below
Replace my.ca with the name of your correspondence analysis. By default it is called correspondence.analysis, but it can have numbers affixed to the end if you have created several correspondence analysis plots. You can find the correct name by clicking on the map and looking for the name in the Object Inspector (Properties > GENERAL).

 
rc = my.ca$row.coordinates
cc = my.ca$column.coordinates
library(plotly)
p = plot_ly() 
p = add_trace(p, x = rc[,1], y = rc[,2], z = rc[,3],
              mode = 'text', text = rownames(rc),
              textfont = list(color = "red"), showlegend = FALSE) 
p = add_trace(p, x = cc[,1], y = cc[,2], z = cc[,3], 
              mode = "text", text = rownames(cc), 
              textfont = list(color = "blue"), showlegend = FALSE) 
p <- config(p, displayModeBar = FALSE)
p <- layout(p, scene = list(xaxis = list(title = colnames(rc)[1]),
           yaxis = list(title = colnames(rc)[2]),
           zaxis = list(title = colnames(rc)[3]),
           aspectmode = "data"),
           margin = list(l = 0, r = 0, b = 0, t = 0))
p$sizingPolicy$browser$padding <- 0
my.3d.plot = p

You will now have an interactive visualization like the one below. You can click on it and drag with your mouse to rotate, and use the scroll wheel in your mouse (if you have one) to zoom in and zoom out.

Click the button below to see the original dashboard and modify it however you want!

Explore 3D Correspondence Analysis

Sharing the interactive visualization

You can also share the interactive visualization with others, by using one of the following approaches:

Press Export > Web Page and share the URL of the web page with colleagues. This includes an option to require password access. For more on this, see our Wiki.
Press Export > Embed, which will give you some code that you can embed in blog posts and other websites, which will make the interactive visualization appear in them.

If you click here you will go into Displayr and into a document containing the code used the create the analyses and visualizations in this chart, which you can then modify to re-use for your own analyses.

Adding Supplementary Points to a Correspondence Analysis

Jake Hoare — Thu, 17 Aug 2017 10:08:12 +0000

Retrospectively adding supplementary points to a correspondence analysis can greatly assist in the interpretation of results. In other words, including supplementary row or column points to a correspondence analysis after the core data has determined the map can improve interpretation of the results.

Correspondence analysis is a technique for analyzing tables of data, often used to produce a 2-dimensional scatterplot (map) for brand positioning purposes. The map shows the relative position of brands and various attributes.

This post describes how to add supplementary points to a correspondence analysis, and how to interpret them on a map. It uses time-series and brand subset examples. There is a link to the worked example from this article at the end of this post.

Create your own Correspondence Analysis

How supplementary points can improve interpretation of results

Supplementary points can aid in the interpretation of correspondence analysis by providing additional context to the main data. The context may be depicting changes over time (e.g., tracking data) or treating a subset of data as ancillary points. Supplementary points are additional rows and columns that do not determine the axes, but you can plot them on the same map.

Trends of brand perceptions

An earlier post describes how you can use correspondence analysis to analyze trends. I have repeated one of the scatterplots from this earlier post below. It shows the change in people's perceptions of technology brands from 2012 to 2017.

The plot has Principal coordinates normalization. This means that the distances between row points and the distances between column points are meaningful, but not necessarily the distance between row and column points. Click here for a further explanation about interpretation and normalization.

Separating core and supplementary points

An alternative way to analyze the same data is to consider 2017 as the "ground truth" and plot 2012 as supplementary points. This means that the 2017 data determine the dimensions and axes of the map. You can add 2012 data after as supplementary points. In technical correspondence analysis terminology, the 2012 rows have zero mass.

We can see below that while the output shows the same themes as the first chart, it is different in the detail. Easy to use and Innovative are now closer together. We can now deduce that on the basis of 2017 data, Easy to use and Innovative have more similar meanings.

Yet another perspective is to consider 2012 the ground truth and to then plot the 2017 points as supplementary. This produces the results below where Easy to use and Innovative are further apart than in the original chart. Evidently, the association between innovation and ease of use is a more recent phenomenon.

All three charts are equally valid views of the data. They differ in their emphasis. For instance, the second chart would be most relevant for a study on the state of the technology market in 2017. In this case, the 2012 data is added for context but does not influence the positioning of the 2017 points.

Note that the first chart from the previous post is an "average" (in a strictly non-technical, hand-waving sense!) of the 2012 and 2017 charts.

Focusing on a subset of data

The second example below is the correspondence analysis resulting from a table of 14 car models. Let's say we wanted to study the 4 German brands. They form a line across the top from Volkswagen on the left, through Audi, Mercedes then BMW. The chart has Row principal normalization. This means that is it valid to compare distances between row points. It is also valid to measure the association between rows and columns by their scalar products.

We might be tempted to say that the Volkswagen was Popular, the Audi and Mercedes are Luxury and the BMW X5 is Sporty. Before doing so, note that the total explained variance is only 53%. This means there is information hidden in the dimensions that are not plotted.

Let's repeat the analysis, this time treating all the non-German cars as supplementary. Now we see that the Audi A4 is very near the center of the plot. This means that it is not strongly associated with any of the characteristics. We can conclude that amongst all 14 cars the Audi is considered a luxury car, but amongst the German cars, it is not. Note also that the total explained variance below is now almost 97%. This means that we can be more confident about our conclusions.

There is also a close relationship between Family and Sporty. Evidently, the German cars discriminate relatively little between those characteristics.

Finally, we can check the result above by removing the supplementary points. This produces the chart below, which is the same except we can no longer see how the German cars relate to the non-Germans.

Conclusion

You can add data to a "core" correspondence analysis as supplementary points. The advantage of supplementary points over just analyzing all the data together is that supplementary points do not influence the placement of core data points. As the name implies, they are added after the core data has determined the map. Supplementary data points are an excellent way to provide additional context to an analysis that is driven entirely by another part of the data set.

TRY IT OUT
All the analysis in this post was conducted in Displayr. Review the worked example from this post or run your own analysis by clicking through to this correspondence analysis example. The supplementary points are specified in the Inputs panel, seen on the right after clicking on any map. You can also try your own correspondence analysis for free in Displayr.

Create your own Correspondence Analysis

The flipDimensionReduction package (available on GitHub) was used, which itself uses the ca package for correspondence analysis.

The car data is from a latent feature analysis performed in Meulders, M. (2013). An R Package for Probabilistic Latent Feature Analysis of Two-Way TwoMode Frequencies. Journal of Statistical Software, 54(14), 1-29. This analysis uses data from Van Gysel, E. (2011). Perceptuele analyse van automodellen met probabilistische feature modellen. [translation from Dutch: Perceptual analysis of car models with probabilistic feature models] Master thesis. Hogeschool-Universiteit Brussel.

Moonplots: A Better Visualization for Brand Maps

Tim Bock — Tue, 15 Aug 2017 00:43:41 +0000

A correspondence analysis is the standard tool for creating brand maps. It shows which brands compete with which other brands and the basis for that competition.

A standard brand map is easily misread

The example of a correspondence analysis plot below is pretty standard. It shows data using row principal normalization, which is the best normalization for brand mapping data. To an expert in correspondence analysis, this map is easy to read. Furthermore, to a novice, it is also easy to read. Unfortunately, the novice generally misreads such a map, as the map encourages the less-expert viewer to draw incorrect conclusions.

A novice will look at this map and draw conclusions based on the distance between points. This is how a scatterplot is almost always read, as such an interpretation is an obvious one (the plot below is a scatterplot). As a result, this interpretation will lead to conclusions such as Diet Coke is associated with Beautiful, and Pepsi with Urban. Unfortunately, these conclusions are wrong.

The correct interpretation

The correct interpretation of the map above is that Diet Coke is strongly associated with Innocent, Sleepy, Feminine, Weight-conscious, and Health-conscious. In fact, the strength of association between an attribute and a brand is not determined by their distance on a map. It is instead computed using the following steps (please read How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for a more detailed explanation):

Drawing a line from the brand of interest to the origin.
Drawing a line from the attribute of interest to the origin.
Calculating the angle of the line.
Computing the cosine of the angle.
Measuring the distance of the first line.
Measuring the distance of the second line.
Multiplying together the cosine of the angle with the two distances.

This is, by any yardstick, a complicated set of instructions for reading a visualization. Consequently, it is hard to believe that even people that understand the correct interpretation will take the time to diligently apply it.

The difficulties of interpretation have a few possible solutions. One is training. Sure this is a good idea, but the point of this visualization is that taps into our intuitive visual interpretation skills. So if training is required the purpose of visualization is undermined. Another solution is to draw lines from the origin of the map to the brands (or the attributes). Yet this still requires training (how else will people know the meaning of the lines?), so it is not a sufficient solution.

The solution is a moonplot

Illustrated below is an example of a moonplot. The key difference between the moonplot and brand maps relates to the display of attributes. The scatterplot above plots the attributes in the same space as the brands. While the moonplot plots all attributes equidistance from the center of the visualization. The font sizes, on the map below, contain the same information conveyed in the earlier brand map by the distance of the attributes to the origins.

Advantages of the moonplot over traditional brand maps

This moonplot visualization has some big advantages over the traditional brand map display:

First, it is tidier.
Next, the tidiness makes it easier to understand the extent to which brands' positions are strong. Coke Zero, and (to a lesser extent), Pepsi Max, are closer to the center of the map than Diet Pepsi and Diet Coke. This means they are less differentiated than the other brands based on the attributes in the study. While an expert can obtain the same conclusion from the traditional map, with the moonplot it is obvious to everyone (novice to expert).
The varying font sizes make it clear that all attributes are not equal. For example, the small font for Beautiful makes it clear that in some sense the attribute is unimportant. To deduce this from the traditional map requires expertise.
Most importantly, the obvious interpretation of this map is correct in terms of the brand associations. For example, it is clear on this map that Diet Pepsi is associated with Feminine, Innocent, Sleepy, Weight-conscious, and Health-conscious. The user can work this out by glancing at the map, with no need for rulers, protractors, nor an understanding of the dot product.

To create a moonplot using your own data

Click here to create your own moonplot, by signing into the Displayr document used to create the visualizations in this post.
Click on the moonplot (on the third page), and change the Data source (far right), to Type or paste data. (Or, import a new data set, create a new table, and select the new table as the data.)
Press Edit data, and paste in your data and press OK.

Normalization and Scaling in Correspondence Analysis

Tim Bock — Mon, 07 Aug 2017 18:17:48 +0000

Most correspondence analyses plots are misleading in at least three different ways, but the choice of normalization can increase this to five, so you want to get the choice of normalization right. This post provides an overview of the main normalization options, explains how to interpret the resulting maps, provides a technical explanation of the normalizations, and gives recommendations for the best approach to normalization for different situations.

If you need to create your own correspondence analysis, you can do so using the template below.

Correspondence analysis is a useful technique for compressing the information from a large table into a relatively-easy-to-read scatterplot. The resulting plot, as is the case with most simplifications, is often misleading. When the plot is made, the analyst chooses or leaves to a default setting, the normalization. This setting governs how the resulting map should be interpreted.

Overview of normalization options in correspondence analysis

The table below lists the main normalizations, the key concepts and terminology used. Please take note of one really important issue: there is no commonly-agreed upon meaning of the word "symmetric(al)". Different apps and authors use it to mean completely different things. For example, the most widely used program, SPSS, uses a meaning that is completely different from that of the most widely read author on the topic, Michael Greenacre. For this reason, I do not use this term.

Normalization	Other names	Definition of row coordinates	Definition of column coordinates	How to interpret relationships between row coordinates	How to interpret relationships between column coordinates	How to interpret relationships between row and column categories
Standard	Symmetrical	Standard	Standard	The vertical distances are exaggerated	The vertical distances are exaggerated	No straightforward interpretation
Row principal	Row, Row asymmetric, Asymmetric map of the rows, Row-metric-preserving	Principal	Standard	Proximity	The vertical distances are exaggerated	Dot product
Row principal (scaled)		Principal	Standard * first eigenvalue	Proximity	The vertical distances are exaggerated	Proportional dot product
Column principal (scaled)	Column, Column asymmetric, Asymmetric map of the columns, Column-metric-preserving	Standard * first eigenvalue	Principal	The vertical distances are exaggerated	Proximity	Proportional dot product
Column principal		Standard	Principal	The vertical distances are exaggerated	Proximity	Dot product
Principal	Symmetric map, French scaling, Benzécri scaling, Canonical, Configuration Plot	Principal	Principal	Proximity	Proximity	No straightforward interpretation
Symmetrical (1/2)	Symmetrical, Symmetric, Canonical scaling	Standard * sqrt(singular values)	Standard * sqrt(singular values)	The vertical distances are somewhat exaggerated	The vertical distances are somewhat exaggerated	Dot product

Interpreting plots created with the different normalizations

The first requirement for correct interpretation of correspondence analysis is a scatterplot with an aspect ratio of 1, which is the technical way of saying that the physical distance on a plot between values on the x-axis and y-axis need to be the same. If you look at the plot below, you will see that the distance between 0 and 1 on the x-axis is the same as the on the y-axis, so this basic hurdle has been passed. But, if you are viewing correspondence analysis in general-purpose charting tools, such as Excel or ggplot, be careful, as they will not, by default, respect the aspect ratio, which will make the plots misleading.

You can sign in to Displayr and explore this normalization example here.

Most standard correspondence analysis plots are misleading

As I mentioned in my introductory paragraph, most standard correspondence analysis plots are misleading in at least three ways.

The first way is that they only show relativities. For example, the plot above suggests that Pepsi and Coke (which were rows in the table) are both associated with Traditional or Older (columns). However, there is no way to conclude from this map which brand has the highest score on any attribute. In the case of maps using brand association data, it is quite common to have a leading brand with the highest score on all the attributes; the key when interpreting is to remember that the map only shows relativities.

The second general way that correspondence analysis maps mislead relates to the variance explained. If you add up the percentages in the x and y axis labels, you will see that they add up to 97.5%. So, 2.5% of the variance in the data is not explained. This is not much. But, the percentage can be much higher. The higher the percentage, the more misleading the plot. And, of course, it is possible that the two dimensions explain 100% of the variance, as is illustrated in Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R.

The map above is misleading in a third way. To the naked eye, it misrepresents the relationship between the columns. The plot shows that Weight-conscious is roughly the same distance apart from Older as it is from Rebellious. This is a misrepresentation of the data. To correctly interpret the relationship between the row coordinates, we need to remember that the vertical dimension explains only about a third of the variance, so vertical distances for the column coordinates are on this plot are exaggerated. If you look at the plot below, it shows the relationship between the columns properly.

Row principal normalization and principal normalization

What is the difference between the two plots? The top one uses row principal normalization. This means it gets the rows right, but not the columns. The plot below uses principal normalization, which means it gets the rows and columns correct.

At this stage, it no doubt seems the principal normalization is better. Who would want a map which misrepresented the relationship between the column categories? Unfortunately, the principal normalization comes with its own great limitation.

The principal normalization is great at showing the relationships within the row coordinates, and also within the column coordinates. However, it misrepresents the relationships between the row and the column categories. In the row principal normalization shown above, we can infer the relationship between row and column categories by looking at how far they are from the origin, and also the angle formed by the lines that connect them to the origin (if you are not familiar with how to interpret the relationship between the row and column categories, please see Understanding the Math of Correspondence Analysis: A Tutorial Using Examples in R for a technical discussion and How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) for examples and a more intuitive explanation).

The misrepresentation of the relationships between the row and column categories can best be described as being moderate. Yes, it is not possible to correctly work out all the relationships from the map, even if the map explains 100% of the variance. However, any strong relationships that appear on the map are likely to be correct. This makes the principal normalization a good default normalization. However, in situations where there is a clear focus on the rows, such as when using it to show brand positioning, as in these examples, the row principal normalization is generally superior.

It is also possible to use column principal normalization. If I have done a good job in explaining things, you can hopefully work out that this normalization correctly shows the relationships between the rows and the columns, but misrepresents the relationships among the row categories.

Symmetric (1/2) normalization

The next useful normalization is one that is referred to in Displayr and Q as symmetric (1/2) normalization. This normalization, shown below and defined in a bit more detail in the next section, correctly shows the relationship between the row and column coordinates. But, it misrepresents the relationships among the row points, and also among the column points. So, of all the normalization we have seen so far, it is the one that misrepresents the data in the most ways. However, it does have an advantage. Its degree of misrepresentation is the smallest. That is, while the row normalization misrepresents the column coordinates by quite a large amount, the symmetric 1/2 misrepresents them by a smaller amount. Similarly, while the column normalization misrepresents the row coordinates by a large amount, the plot below does so by a smaller amount.

The consequence of this is that if in a situation where the main interest is in the relationships between the row and column coordinates, and there is no clear way of knowing whether to choose between row or column principal normalization, this approach is the best one.

My favorite normalization

In my own work, I favor a variant of row principal normalization. In most of my work, I set up the tables so that the rows represent brands, as in this post. It is obvious to my clients that the brands are the focus, so they never get confused about the column coordinates issue, as they are not so interested in the relationships among the column categories. However, I have recently started using an improved variant of row principal normalization. Below I have repeated the row principal plot from the beginning of the post. A practical problem with this normalization is that the row categories tend to cluster in the middle of the map and the column categories at the periphery. Sometimes this can make it impossible to read the row categories, as they are all overlapping.

A straightforward improvement on the row principal normalization is to scale the column coordinates on the same scale as the x-axis of the row coordinates. This results in what Q and Displayr refer to as row principal (scaled) normalization. As I discuss in the next section, this is an improvement without cost.

A technical explanation of the different normalizations

Below are the core numerical outputs of a correspondence analysis of the data used in this post. The first row shows the singular values. The remaining rows show the standard coordinates for the rows (brands) and columns (attributes). Refer to Understanding the Math of Correspondence Analysis, for a detailed explanation about what these are and how they are computed.

In the row principal normalization, you multiply the position coordinates of each of the row categories from the original table (i.e., Coke through Pepsi Max) by the corresponding singular values. The first two dimension are then plotted. For example, for Coke Zero, its coordinate on the x-axis is .669*-0.63 = -.42, and its position on the y-axis is .391*.99 = .39. As mentioned, if the two dimensions explain all the variance in the data, then the positions of Coke Zero relative to all the other brands on the map is correct.

Expressing these calculations as formulas, we have:

x for a row = Singular value 1 * Standard Coordinate 1

and

y for a row = Singular value 2 * Standard Coordinate 2

For the column categories, we just plot the standard coordinates:

x for a column = Standard Coordinate 1

y for a column = Standard Coordinate 2

This simpler formula is not correct. By ignoring the singular values, these coordinates misrepresent the scale. However, the reason for this "mistake" is that the dot product of these coordinates is meaningful. As described in Understanding the Math of Correspondence Analysis, correspondence analysis allows us to understand the relationships between rows and column categories, where this relationship is formally quantified as the indexed residuals, where:

Indexed residual for x and y = x for row * x for column + y for row * y for column

If you substitute in the earlier formulas this gives us:

Indexed residual for x and y = Singular value 1 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 * Standard Coordinate 2 * Standard Coordinate 2

When we use the principal normalization, this means we use the principal coordinates for both the row and column categories, which changes the formula to Singular value 1 ^ 2 * Standard Coordinate 1 * Standard Coordinate 1 + Singular value 2 ^ 2 * Standard Coordinate 2 * Standard Coordinate 2. As you can see, this puts the singular values in twice, and so no longer correctly computes the indexed values.

The symmetric (1/2) normalization computes the coordinates for x and y for both row and column coordinates using Sqrt(Singular value) * Standard Coordinate. As the principal coordinates, which multiply by the singular values rather than their square roots is correct, it follows that this normalization is neither correct for within row comparisons nor for within column comparisons. Nevertheless, its degree of error is lower than standard coordinates. The indexed residuals are correctly computed because Sqrt(Singular value) * Sqrt(Singular value) = Singular value.

The row principal (scaled) normalization uses the principal coordinates for the row categories and for the column categories uses:

x for a column = Singular value 1 * Standard Coordinate 1

y for a column = Singular value 1 * Standard Coordinate 2

That is, it uses the first singular value for each of the two coordinates. This has the effect of contracting the scatter of the column coordinates on the map, but makes no change to their relativities (i.e., they remain wrong, as they ignore the reality that the y dimension explains less variation). This normalization also changes the indexed residual, so that rather than the dot product being exactly equal to the indexed residual when the plot explains 100% of the variance, instead the dot product becomes proportional to the indexed residual. Changing from an equality to a proportionality has no practical implication of any kind, as relationships between the row and column categories are only ever interpreted from correspondence analysis as relativities. This is why the scaling of row principal is generally appropriate.

Column principal (scaled) is the same as row principal (scaled), except that the focus is switch from the columns to the rows.

Conclusion

For the reasons outlined in this post, my view is that either the row principal (scaled) normalization or the column principal (scaled) normalization is typically best. Although principal is an appropriate default in situations where the viewer is not actively involved in working out and communicating the most appropriate normalization.

Explore the data

All of the examples in this post have I created with R. You can view and play with the examples, including using your own data, by clicking on this link: examples of normalization and signing into Displayr to see the document that I wrote when writing this post.

Understanding the Math of Correspondence Analysis

Tim Bock — Mon, 07 Aug 2017 18:13:51 +0000

If you want to quickly make your own correspondence analysis, this is probably the wrong post for you - but you can easily do that using this template!

Correspondence Analysis in R: A case study

The data that I analyze shows the relationship between thoroughness of newspaper readership by education level. It is a contingency table, which is to say that each number in the table represents the number of people in each pair of categories. For example, the cell in the top-left corner tells us that 5 people with some primary education glanced at the newspaper. The table shows the data for a sample of 312 people (which is also the sum of the numbers displayed).

I show the R code for generating this table below. I have named the resulting table N.

 
N = matrix(c(5, 18, 19, 12, 3, 7, 46, 29, 40, 7, 2, 20, 39, 49, 16), 
 nrow = 5,

 dimnames = list(
 "Level of education" = c("Some primary", "Primary completed", "Some secondary", "Secondary completed", "Some tertiary"),
 "Category of readership" = c("Glance", "Fairly thorough", "Very thorough")))

Computing the observed proportions (P) in R

The first step in correspondence analysis is to sum up all the values in the table. I've called this total n.

 n = sum(N)

Then, we compute the table of proportions, P. It is typical to use this same formula in other types of tables, even if the resulting numbers are not strictly-speaking proportions. Examples include correspondence analysis of tables of means or multiple response data.

P = N / n

This gives us the following table. To make it easy to read, I have done all the calculations in Displayr, which automatically formats R tables using HTML. If you do the calculations in normal R, you will instead get text-based table like the one above. Sign-in to Displayr and view the document that contains all the R calculations in this post.

Row and column masses

In the language of correspondence analysis, the sums of the rows and columns of the table of proportions are called masses. These are the inputs to lots of different calculations. The column masses in this example show that Glance, Fairly thorough, and Very thorough describe the reading habits of 18.3%, 41.3%, and 40.4% of the sample respectively. We can compute the column masses using the following R code:

 
column.masses = colSums(P)

The row masses are Some primary (4.5%), Primary completed (26.9%), Some secondary (27.9%), Secondary completed (32.4%), and Some tertiary (8.3%). These are computed using:

 
row.masses = rowSums(P)

Expected proportions (E)

Referring back to the original table of proportions, 1.6% of people glanced and had some primary education. Is this number big or small? We can compute the value that we would expect to see under the assumption that there is no relationship between education and readership. The proportion that glances at a newspaper is 18.2% and 4.5% have only Some primary education. Thus, if there is no relationship between education and readership, we would expect that 4.5% of 18.2% of people (i.e., 0.008 = 0.8%) have both glanced and have primary education. We can compute the expected proportions of all the cells in the table in the same way.

The following R code computes all the values in a single line of code, where %o% means that a table is created by multiplying each of the row totals (row masses) by each of the column totals.

E = row.masses %o% column.masses

Residuals (R)

We compute the residuals by subtracting the expected proportions from the observed proportions. Residuals in correspondence analysis have a different role to that which is typical in statistics. Typically in statistics, the residuals quantify the extent of error in a model. In correspondence analysis, by contrast, the whole focus is on examining the residuals.

The residuals quantify the difference between the observed data and the data we would expect under the assumption that there is no relationship between the row and column categories of the table (i.e., education and readership, in our example).

R = P - E

The biggest residual is -0.045 for Primary completed and Very thorough. That is, the observed proportion of people that only completed primary school and are very thorough is 6.4%, and this is 4.5% lower than the expected proportion of 10.9%, which is computed under the assumption of no relationship between newspaper readership and education. Thus, the tentative conclusion that we can draw from this is that there is a negative association between having completed primary education and reading very thoroughly. That is, people with only primary school education are less likely to read very thoroughly than the average person.

Indexed residuals (I)

Take a look at the top row of the residuals shown in the table above. All of the numbers are close to 0. The obvious explanation for this - that having some primary education is unrelated to reading behavior - is not correct. The real explanation is all the observed proportions (P) and the expected proportions (E) are small because only 4.6% of the sample had this level of education. This highlights a problem with looking at residuals from a table. By ignoring the number of people in each of the rows and columns, we end up being most likely to find results only in rows and columns with larger totals (masses). We can solve this problem by dividing the residuals by the expected values, which gives us a table of indexed residuals (I).

I = R / E

The indexed residuals have a straightforward interpretation. The further the value from the table, the larger the observed proportion relative to the expected proportion. We can now see a clear pattern. The biggest value on the table is the .95 for Some primary and Glance. This tells us that people with some primary education are almost twice as likely to Glance at a newspaper as we would expect if there were no relationship between education and reading. In other words, the observed value is 95% higher than the expected value. Reading along this first row, we see that there is a weaker, but positive, indexed residual of 0.21 for Fairly thorough and Some primary. This tells us that people with some primary education were 21% more likely to be fairly thorough readers that we would expect. And, a score of -.65 for Very thorough, tells us that people with Some primary education were 65% less likely to be Very thorough readers than expected. Reading through all the numbers on the table, the overall pattern is that higher levels of education equate to a more thorough readership.

As we will see later, correspondence analysis is a technique designed for visualizing these indexed values.

Reconstituting indexed residuals from a map

The chart below is a correspondence analysis with the coordinates computed using row principal normalization. I will explain its computation later. Now, I am going to show how we can work backward from this map to the indexed residuals, in much the same way that we can recreate orange juice from orange juice concentrate. Some Primary has coordinates of (-.55, -.23) and Glance's coordinates are (-.96, -1.89). We can compute the indexed value by multiplying together the two x coordinates and the two y coordinates and summing them up. Thus we have -.55*-.96 + -.23 * -1.89 = .53 + .44 = .97. Taking rounding errors into account, this is identical to the value of .95 shown in the table above.

Unless you have studied some linear algebra, there is a good chance that this calculation, known as the dot product (or a scalar product or inner product), is not intuitive. Fortunately, it can be computed it in a different way that makes it more intuitive.

To compute the indexed residual for a couple of points, we start by measuring the distance between each of the points and the origin (see the image to the right). In the case of Some primary, the distance is .59. Then, we compute the distance for Glance, which is 2.12. Then we compute the angle formed when we draw lines from each of the points to the origin. This is 41 degrees. Lastly, we multiply together each of these distances with the cosine of the angle. This gives us .59*2.12*cos(41°) = .59*2.12*.76 = .94. Once rounding errors are taken into account, is the same as the correct value of .95.

Now, perhaps this new formula looks no simpler than the dot product, but if you look at it a bit closer, it becomes pretty straightforward. The first two parts of the formula are the distance of each point from the origin (i.e., the (0,0) coordinate). Thus, all else being equal, the further the point is from the origin, the stronger the associations between that point and the other points on the map. So, looking at the top, we can see that the column category of Glance is the one which is most discriminating in terms of the readership categories.

The second part to interpretation, which will likely bring you back to high school, is the meaning of the cosine. If two points are in exactly the same direction from the origin (i.e., they are on the same line), the cosine of the angle is 1. The bigger the angle, the smaller the cosine, until we get to a right-angle (90° or 270°), at which point the cosine is 0. And, when the lines are going in exactly opposite directions (i.e., so the line between the two points goes through the origin), the cosine of the angle is -1. So, when you have a small angle from the lines connecting the points to the origin, the association is relatively strong (i.e., a positive indexed residual). When there is a right angle there is no association (i.e., no residual). When there is a wide angle, a negative residual is an outcome.

Putting all this together allows us to work out the following things from the row principal correspondence analysis map above, which I have reproduced below to limit scrolling:

People with only Primary completed are relatively unlikely to be Very thorough.
Those with Some primary are more likely to Glance.
People with Primary completed are more likely to be Fairly thorough.
The more education somebody has, the more likely they are to be Very thorough.

Reconstituting residuals from bigger tables

If you look at the chart above, you can see that it shows percentages in the x and y labels. (I will describe how these are computed below.) They indicate how much of the variation in the indexed residuals is explained by the horizontal and vertical coordinates. As these add up to 100%, we can perfectly reconstitute the indexed residuals from the data. For most tables, however, they add up to less than 100%. This means that there is some degree of information missing from the map. This is not unlike reconstituted orange juice, which falls short of fresh orange juice.

The post How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think) provides a much more thorough (but un-mathematical) description of issues arising with the interpretation of correspondence analysis.

Singular values, eigenvalues, and variance explained

In the previous two sections, I described the relationship between the coordinates on the map and the indexed residuals. In this section, I am going to explain how the coordinates are computed from the indexed residuals.

The first step in computing the coordinates is to do a near-magical bit of mathematics called a Singular Value Decomposition (SVD). I have had a go at expressing this in layperson's language in my post An Intuitive Explanation of the Singular Value Decomposition (SVD): A Tutorial in R, which works through the same example that I have used in this post.

The code that I used for performing the SVD of the indexed residuals is shown below. The first line computes Z, by multiplying together each of indexed residuals by the square root of their corresponding expected values. This seems a bit mysterious at first, but two interesting things are going on here.

First, Z is a standardized residual, which is a rather cool type of statistic in its own right. Second, and more importantly from the perspective of correspondence analysis, what this does is cause the singular value decomposition to be weighted, such that cells with a higher expected value are given a higher weight in the data. As often the expected values are related to the sample size, this weighting means that smaller cells on the table, for which the sampling error will be larger, are down-weighted. In other words, this weighting makes correspondence analysis relatively robust to outliers caused by sampling error, when the table being analyzed is a contingency table.

Z = I * sqrt(E)
SVD = svd(Z)
rownames(SVD$u) = rownames(P)
rownames(SVD$v) = colnames(P)

A singular value decomposition has three outputs:

A vector, d, contains the singular values.
A matrix u which contains the left singular vectors.
A matrix v with the right singular vectors.

The left singular vectors correspond to the categories in the rows of the table and the right singular vectors correspond to the columns. Each of the singular values, and the corresponding vectors (i.e., columns of u and v), correspond to a dimension. As we will see, the coordinates used to plot row and column categories are derived from the first two dimensions.

Squared singular values are known as eigenvalues. The eigenvalues in our example are .0704, .0129, and .0000.

eigenvalues = SVD$d^2

Each of these eigenvalues is proportional to the amount of variance explained by the columns. By summing them up and expressing them as a proportion, which is done by the R function prop.table(eigenvalues), we compute that the first dimension of our correspondence analysis explains 84.5% of the variance in the data and the second 15.5%, which are the numbers shown in x and y labels of the scatter plot shown earlier. The third dimension explains 0.0% of the variance, so we can ignore it entirely. This is why we are able to perfectly reconstitute the indexed residuals from the correspondence analysis plot.

Standard coordinates

As mentioned, we have weighted the indexed residuals prior to performing the SVD. So, in order to get coordinates that represent the indexed residuals we now need to unweight the SVD's outputs. We do this by dividing each row of the left singular vectors by the square root of the row masses (defined near the beginning of this post):

standard.coordinates.rows = sweep(SVD$u, 1, sqrt(row.masses), "/")

This gives is the standard coordinates of the rows:

We do the same process for the right singular vectors, except we use the column masses:

standard.coordinates.columns = sweep(SVD$v, 1, sqrt(column.masses), "/")

This gives us the standard coordinates of the columns, shown below. These are the coordinates that have been used to plot the column categories on the maps we in this post.

Principal coordinates

The principal coordinates are the standard coordinates multiplied by the corresponding singular values:

principal.coordinates.rows = sweep(standard.coordinates.rows, 2, SVD$d, "*")

The positions of the row categories shown on the earlier plots are these principal coordinates. The principal coordinates for the education levels (rows) are shown in the table below.

The principal coordinates represent the distance between the row profiles of the original table. The row profiles are shown in the table below. They are the raw data (N) divided by the row totals. Outside of correspondence analysis, they are more commonly referred to as the row percentages of the contingency table. The more similar two rows' principal coordinates, the more similar their row profiles. More precisely, when we plot the principal coordinates, the distances between the points are chi-square distances. These are the distances between the rows weighted based on the column masses. You can find the R calculations for the chi-square distances here.

The principal coordinates for the columns are computed in the same way:

principal.coordinates.columns = sweep(standard.coordinates.columns, 2, SVD$d, "*")

In the row principal plot shown earlier, the row categories' positions are the principal coordinates. The column categories are plotted based on the standard coordinates. This means that it is valid to compare row categories based on their proximity to each other. It is also valid to understand the relationship between the row and column coordinates based on their dot products. But, it is not valid to compare the column points based on their position. I discuss this in more detail in a post called Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R.

Quality

We have already looked at one metric of the quality of a correspondence analysis: the proportion of the variance explained. We can also compute the quality of the correspondence analysis for each of the points on a map. Recall that the further a point is from the origin, the greater that point is explained by the correspondence analysis. When we square the principal coordinates and express these as row proportions, we get measures of the quality of each dimension for each point. Sometimes these are referred to as the squared correlations and squared cosines.

 
pc = rbind(principal.coordinates.rows, principal.coordinates.columns) 
prop.table(pc ^ 2, 1)

The quality of the map for a particular category is usually defined as the sum of the scores it gets for the two dimensions that are plotted. In our example, these all add up to 100%.

Acknowledgments

The data in the example comes from Greenacre and Hastie's 1987 paper "The geometric interpretation of correspondence analysis", published in the Journal of the American Statistical Association.

Where practical, I have used the notation and terminology used in Michael Greenacre's (2016) third edition of Correspondence Analysis in Practice. This excellent book contains many additional calculations for correspondence analysis diagnostics. The only intentional large deviation from Greenacre's terminology relates to the description of the normalizations (I discuss the differences in terminology in Normalization and the Scaling Problem in Correspondence Analysis: A Tutorial Using R).

This post is partly based on a paper that I wrote for the International Journal of Market Research, "Improving the display of correspondence analysis using moon plots", in 2011.

TRY IT OUT

You can sign-in to Displayr and view the document that contains all the R calculations in this post.

Correspondence Analysis of Square Tables

Jake Hoare — Mon, 24 Jul 2017 17:52:51 +0000

Square tables are data tables where the rows and columns have the same labels, commonly seen as a crosstab of brand switching or brand repertoire data. Correspondence analysis is often used to visualize these tables as a much simpler chart. In this post I discuss the special case of square tables, using examples of cereal brand-switching, and switching between professions.

As background, this earlier post describes what correspondence analysis is. This post describes how correspondence analysis works and how to interpret the results.

Correspondence Analysis of Square Tables

A typical table used for correspondence analysis shows the responses to one question along the rows and responses to another question along the columns. Each question has its own set of mutually exclusive categorical responses. The cell at the intersection of any row and column contains the count of cases with that combination of row and column responses. I say "typical table" above because there are other use cases, such as tables of time series, raw data, and means, all of which are described in this post which describes when to use correspondence analysis.

In general, the sets of responses labeling the rows and columns are different. For example, the rows may be labeled by each respondent's favorite color and the columns by their favorite sport. If instead, we labeled the columns by their partner's favorite color, then we have an example of a square table.

A square table, in this case, does not just have the same number of rows as columns. The rows and columns have identical labels, and they are presented in the same order. Such tables may also be called switching matrices, transition tables or confusion matrices.

Below I show an example as a heatmap for easier visualization. The data relate to brand switching between breakfast cereals. The rows contain the first cereal purchased, the columns contain the next cereal purchased.

Data symmetry

Before delving into the correspondence analysis, let's take a look at the data above. One of the first observations that we can make about it is the strong diagonal. On the whole, people tend to buy the same cereal repeatedly.

Looking away from the diagonal, there is also high symmetry. For example, the numbers switching from Cornflakes to Rice Krispies (80) is almost the same as switching in the other direction (81). Both of these observations are quite typical of square tables from consumer data.

Now let's perform the correspondence analysis. The scatterplot below shows the first 2 output dimensions.

Interpretation of Square Correspondence Analysis

It's tempting to draw immediate conclusions from the plot above. Before we do so, we need to take note of a few things.

First, any square matrix can be broken down into symmetric and skew-symmetric components. The correspondence analysis of those two components is driven by different aspects of the data, and they are best analyzed separately.

The symmetric component shows us how much 2-way exchange occurs between categories. Points that are close together have a relatively high rate of exchange between them.
The skew-symmetric component determines the net flow into or out of a category. Points that are close together have similar net flows with the other categories.

We can tell which dimensions are symmetric and which are skew-symmetric by inspecting the how much variance each dimension explains. The symmetric component produces dimensions that each explain a different amount of the variation in the table. In more technical language the eigenvalues, inertias or canonical correlations are unique. Correspondence analysis of the skew-symmetric component produces dimensions that occur in pairs. Both dimensions within a pair explain the same amount of variation.

Second, always take note of the amount of variance explained by each dimension. When the total explained by both dimensions of a chart is much less than 100%, the unseen dimensions contain a significant amount of information.

Third, points further from the origin are more strongly differentiated. Conversely, points that are close to the origin are less distinct and may not be similar (other than their mutual lack of distinction!).

Finally, this post covers interpretation of correspondence analysis in much more detail.

Cereal interpretation

To understand the cereal correspondence analysis, let's look at the variance explained by each dimension,

We see that dimensions 1 to 6 has unique amounts of variance explained, so they are symmetric. Taking a closer look at the raw output below, dimensions 7 to 12 occur in pairs, so they are a result of the skew-symmetric component.

Since the earlier scatter plot showed the first two dimensions, we can now say that they are symmetric. This means that there is relatively little switching to or from Shredded Wheat. Frosties and Crunchy Nut Cornflakes form a pair, indicating a relatively high level of switching between those brands. The other 4 brands also form a loose group of mutual interchange. However, these two dimensions only account for 62% of the variance, so they do not tell us everything about the data.

The fact that the first 6 dimensions result from the symmetric component confirms our earlier observation about the symmetry of the data. In fact, 99.5% of the variance is due to symmetry. It is not unusual that the symmetric component is dominant. In this case, it would be unwise to plot the skew-symmetric dimensions since they represent such a tiny part of the information. I would also never plot a symmetric and skew-symmetric dimension on the same chart.

Less Symmetric Data

As a second example, I am using data about how people transition between jobs. This is a somewhat mature data set, referring to German politicians in the 1840s. You will not find software engineer listed. The rows of the table tell us the professions held by the politicians prior to their terms, and the columns tell us the professions that they held after they left office.

The plot below shows the first two symmetric dimensions.

From this, we conclude that there is a relatively high exchange between Justice, Administration, and Lawyer. There is also a high exchange between Education and Self-employed.

The skew-symmetric component is 15% of the variance, which is much more than for the cereal data but still a small part of the whole. On the chart below we see that Lawyer and Justice are at the extremities. This means that those professions experience a relatively high net inflow and outflow.

We cannot say from the chart which has the inflow and which has the outflow. The only way to tell is to look at the raw data. To clarify this point, we can compute the net inflow for each professional by working out the difference between the row totals and column totals for each profession in the original table. The final column chart shows us that Lawyer has the inflow, and Justice has the outflow.

Conclusion

One key advantage of using correspondence analysis specifically for a square table is that we do not need to plot row and columns labels separately. This means that we can interpret the closeness of points on the same scale. However, as with all correspondence analysis, we need to take care to draw correct conclusions. In particular, the symmetric and skew-symmetric components should be analyzed independently. The symmetric parts tell us about the exchange between different categories, why the skew-symmetric parts tell us about net flows into or out of categories.

TRY IT OUT
All the analysis in this post was conducted Displayr. You can to review the underlying code or run your own analysis by clicking through to this brand switching example. The flipDimensionReduction package (available on GitHub) was used, which itself uses the ca package for correspondence analysis. Check out Q for fast advanced survey analysis and crosstab software.

The cereal data is from Dawes, John. (2007). "The Structure of Switching: An Examination of Market Structure Across Brands and Brand Variants." The historical German politician data is from Greenacre, Michael. "Correspondence Analysis of Square Asymmetric Matrices." Journal of the Royal Statistical Society. Series C (Applied Statistics) 49, no. 3 (2000): 297-310.

Customization of Bubble Charts for Correspondence Analysis in Displayr

Tim Bock — Sat, 08 Jul 2017 01:31:19 +0000

When you insert a bubble chart in Displayr (Insert > Visualization > Bubbleplot), you can customize some aspects of its appearance from the controls that appear in the object inspector on the right of the screen. More advanced customizations can be performed by instead inserting an R Output (Insert > R Output), and writing code. I illustrate this by explaining how I created the visualizations in my Using Bubble Charts to Show Significant Relationships and Residuals in Correspondence Analysis, shown below.

Create your Bubble Plot

The visualization above is shown at the end of the post. It is created by a quite lengthy chunk of code. Fortunately, you do not need to understand all of it! In this post I walk through some of the key steps of customizing bubble charts by modifying this code.

Create your Bubble Plot

Hooking up the code (not as scary as it looks)

The code below creates a correspondence analysis, and then presents this using a bubble chart. To reproduce a similar visualization with your own data:

Create a table in Displayr that contains the data you want to analyze. This is no different to when you would normally do correspondence analysis.
Select the table and you can see the Name of the table in the Object Inspector > Properties > GENERAL. When I did this, the name of my table was table.Q9.
Click on the page containing the table in the list of Pages (far-left of the screen), and select Home > Duplicate, which will create a new page that contains the same table again.
Click on the table on the new page, and select Object Inspector > Inputs > STATISTICS > Statistics - Cells and choose z-Statistic. Repeat this process to de-select %.
Click on the table and change the name of the table in Object Inspector > Properties > GENERAL > Name to table.zScores (or anything else you want).
Insert > R Output and paste in the code below, modifying the first 12 lines as per your needs. In the first line you replace table.Q9 with the name of your table (see step 2). In the 3rd line you replace Egypt with the name of the row that contains the standardized residuals that you wish to use, filling in the other rows with the labels that you wish to have appear on the final visualization.

 x = table.Q9
z = table.zScores
row.to.use = "Egypt"
row.label = "Country"
column.label = "Concern"
title = "Traveler's concerns about different countries (bubbles relate to Egypt)"
legend.title = "Strength of relationship"
# Removing rows and columns to be ignored
remove = c("NET", "Total")
x = x[!rownames(x) %in% remove, !colnames(x) %in% remove]
z = z[row.to.use, !colnames(z) %in% remove]
colnames(x) = paste0(colnames(x), ": ", round(x[row.to.use,]), "%")
# Default circle size (this is relative to the z-scores)
z[abs(z) <= 1.96] <- 0 #This turns off the significance.
default.size = 0.1 # Minimum circle size
my.ca = ca::ca(x)
coords = flipDimensionReduction::CANormalization(my.ca, "Principal")
n.rows = nrow(coords$row.coordinates)
n.columns = nrow(coords$column.coordinates)
coords = rbind(coords$row.coordinates, coords$column.coordinates)
# Creating the 'group' variable
n = n.rows + n.columns
groups <- rep("No association", n.columns) 

groups[z > 0] = paste0("Weakness of ", row.to.use)
groups[z < 0] = paste0("Strength of ", row.to.use) 
groups <- c(rep(row.label, n.rows), groups)
# Setting bubble size
bubble.size <- c(rep(default.size, n.rows), abs(z))
# Labeling the dimensions
singular.values <- round(my.ca$sv^2, 6)
variance.explained <- paste(as.character(round(100 * prop.table(singular.values), 1)), "%", sep = "")[c(1, 2)]
column.labels <- paste("Dimension", c(1, 2), paste0("(", variance.explained, ")"))
bubble.size[bubble.size < default.size] <- default.size
rhtmlLabeledScatter::LabeledScatter(X = coords[, 1],
Y = coords[, 2],
Z = bubble.size,
label = rownames(coords),
label.alt = rownames(coords),
group = groups,
colors = c("Black", "Purple", "#FA614B", "#3E7DCC"),
fixed.aspect = TRUE,
title = title,
x.title = column.labels[1],
y.title = column.labels[2],
z.title = legend.title,
axis.font.size = 10,
labels.font.size = 14,
title.font.size = 20,
legend.font.size = 15,
y.title.font.size = 16,
x.title.font.size = 16)

Turning off the significance testing

The visualization below is the same as the one above, except that the significance testing has been turned off. This was achieved by:

Commenting out line 14 (i.e., typing a # at the very beginning of the line, which prevents that line of code being run).
Removing , "purple" from line 40 and swapping around the order of the two last colors ( "#3E7DCC", "#FA614B"). This is where you customize the colors. You can type in a color code, or a color name, such as "Red" or "Blue".

Only showing the positive residuals

The next plot shows only the positive residuals (i.e., the concerns about Egypt that have the strongest relationship). It was created by:

Removing the three letters abs from line 28.
Commenting out line 25.
In line 40, replacing #3E7DCC with Purple.

Taking the data values off the chart

Lastly, to remove the percentages from the visualization, comment out line 12, which leaves us with the visualization below.

More advanced customizations

If you hover your mouse over the word LabeledScatter in Properties > R CODE (line 34), a tooltip shows all the definitions of the parameters in this function, which allow further customization to be performed.

Create your Bubble Plot

Using Bubble Charts to Show Significant Relationships and Residuals in Correspondence Analysis

Tim Bock — Sat, 08 Jul 2017 00:20:57 +0000

While correspondence analysis does a great job at highlighting relationships in large tables, a practical problem is that correspondence analysis only shows the strongest relationships, and sometimes some of the weaker relationships may be of more interest. One of our users (thanks Katie at JWT!) suggested a solution to this: format the chart to highlight key aspects in the data (e.g., standardized residuals).

If you want to create your own correspondence analysis, you can get started using this handy template.

Case study: travelers' concerns about Egypt

The table below shows American travelers' concerns about different countries (I have analyzed this before in my Palm Trees post). There is too much going on with this table for it to be easy to understand. I have used arrows and colors to highlight interesting patterns based on the standardized residuals, but too many things are highlighted for this to be particularly helpful. This is the classic type of table where correspondence analysis is perfect.

The correspondence analysis of the data is shown below. The two dimensions explain 93% of the variance, which tells us that the map shows the main relationships. However, the map is not doing a good job of explaining the relationships between Egypt and China and the concerns of travelers. Both countries are close to the center of the map. Adding more information to the visualization can enhance it further. In the rest of the post I focus on improving the view of Egypt.

Plotting positive standardized residuals

The standardized residuals are shown below. Remembering that positive numbers indicate a positive correlation between the row and column categories, we can see that there are a few "positive" relationships for Egypt, with Safety being the strongest relationship. As the data is about travelers' concerns, a positive residual indicates a negative issue for Egypt.

Bubbles represent the positive standardized residuals in the plot below. The area of the bubble reveals the strength of the association of the concern with Egypt. This is a lot easier to digest than the residuals. We can easily see that "Safety" stands out as the greatest concern. "Not being understood" and "Friendliness", the next most important issues, appear trivial relative to "Safety".

Adding the raw data to the chart

A limitation of plotting standardized residuals is that they show the strength of association, which can be misinterpreted if there are attributes in the analysis that are widely held or obscure. A simple remedy is to plot the raw data for the brand of interest in the labels. This clears up a likely misinterpretation encouraged by all the earlier charts. You can interpret the previous visualizations as implying a lack of relationship between "Cost" and Egypt. However, 44% of people evidently show concern about the cost of visiting Egypt. There exists, however, no positive correlation because they are much more concerned about the costs with the European countries (you can see this by looking at the original data table, earlier in the post).

Showing positive and negative relationships

The following visualization also shows the negative standardized residuals, drawing the circles in proportion to their absolute values. Blue represents the negative residuals, and the pink color the positive ones. In a more common application, where the correspondence analysis is of positive brand associations, reversing this color-coding would be appropriate.

Showing only significant relationships

The final visualization below shows only the significant associations with Egypt. I think it is the best of the visualizations in this post! If you are wanting to understand the data as it relates to Egypt, this is much more compelling than the original data. We can quickly see that "Cost" represents a comparative advantage, and that Egypt shares its main weaknesses with Mexico. If you want to encourage visitors to Egypt, then you could consider positioning it as a competitor to Mexico. (This data comes from a survey done in 2012, and thus potentially constitutes a poor guide to the market's mind today.)

Software

To see the underlying R code used to create the visualizations in this post, click here, login to Displayr and open the relevant document. You can click on any of the visualizations in Displayr and select Properties > R CODE in the Object Inspector to see the underlying code.

I have also written other posts that describe how to create these visualizations and the differences in the R code between the plots. One of them describes how to create these visualizations in Displayr, and another describing how to do it in Q.

When to Use, and Not Use, Correspondence Analysis

Tim Bock — Mon, 22 May 2017 17:04:17 +0000

Correspondence analysis is one of those rare data science tools which make things simpler. You start with a big table that is too hard to read, and end with a relatively simple visualization. In this post I explain how you can work out if a table is suitable for correspondence analysis.

Correspondence analysis is useful when you have a table with at least two rows and two columns, no missing data, no negative values, and all the data has the same scale. The only hard bit of this to understand is "same scale", which is the focus of the examples here.

Contingency tables (OK)

The classic application for correspondence analysis is the analysis of contingency tables. A contingency table is a crosstab where the row categories are mutually exclusive and the column categories are also mutually exclusive. When your data looks like this, correspondence analysis is usually going to do the job.

In the example below I almost show a contingency table. I say almost, because I have included the row and column totals (labeled as NET). If I were to run correspondence analysis on this table, it would not be valid, because the totals are on a different scale from the rest of the data.

There is a simple way to understand this problem. Does the table cease to make sense if it is sorted by any of its rows or columns? Consider the Coca-Cola row. Sorting by this row would move the NET to the beginning of this row, would it add insight? No. It would not. It would probably just create confusion. Fortunately, most data science apps are smart enough to leave the row and column totals out of correspondence analysis, so I will not talk about this trivial case again. Once the totals are removed, this table is perfect for correspondence analysis.

We could also conduct a correspondence analysis if we instead showed row percentages, column percentages, or index values. However, each will give a different output, as each analysis emphasizes different aspects of the data, and these aspects are emphasized by the resulting correspondence analyses.

Square tables (OK)

The table below shows a special type of table where the rows and columns have the same labels. This table is showing car choices, where the rows represent the cars previously owned, and the columns represent the cars currently owned by a sample of buyers. Such tables have various names, such as switching matrices, transition tables, and confusion matrices. They can all be analyzed using correspondence analysis, but there is a (small) benefit in using a special variant of correspondence analysis designed for such square tables.

The plot below has been created using the special variant of correspondence analysis designed for square tables. The chief practical benefit of this method is that we don't plot both the column and the row labels, and can thus interpret relationships by looking at how closely together the labels appear. In this example, Porche is dominating the analysis. Looking at the raw data, we can see that it has a small sample size. We have three options:

We can simply remove it from the plot (you can do this by dragging and dropping). If you do this, it does not re-estimate the map. Rather, it still includes Porche in the analysis, so this is probably not the ideal solution.
Merge similar brands and re-run the analysis.
Re-run the analysis with the brand removed. This is generally my preferred option.

Tables with multiple statistics (bad)

The table below shows both counts and column percentages. The data here is clearly on two different scales, making correspondence analysis inappropriate. We could scale the counts by turning them into percentages, but then we would just have the same data twice, which would be pointless.

Multiple tables spliced together (bad, unless scaled)

The table below, which shows cola preference by age and gender, would not be great for correspondence analysis. Why? The problem is that the data is not all on the same scale. There is an easy way to see this. If the data is all the same scale, it means that it is meaningful to sort the table by any of its rows and columns. If we were to sort this table by the first row, we would get Male and Female appearing first, because they have larger base sizes, and not because the sorting would be meaningful.

This gives us some insight into how to fix the problem. We need to transform the data in some way so that it is all on the same scale. We can achieve this by dividing each number by the column total (the NET at the bottom of the table), which gives us the table below. With this table, it makes sense to sort by the first row (which reveals that Coca-Cola preferences differ much more widely by age than by gender). It would also be appropriate to sort the table by any of the columns.

The next example has also been constructed using two tables. The last column shows the average attitude score for each brand, collected on a 5-point scale. Hopefully it is easy to see that the data in this table is not on the same scale, making it inappropriate for correspondence analysis.

The next table shows the same data again, but "fixed", so that it adds up to 100. Is this table OK? No. The best way to appreciate the problem is to focus on Diet Pepsi. Diet Pepsi has the lowest score of any of the brands on Attitude. However, if we read across the Diet Pepsi row, we see that Diet Pepsi ends up getting its "best" score for Attitude. If you were to apply correspondence analysis, it would tell you that Pepsi "owns" Attitude.

Multiple response tables and grids (OK)

Most tables that show multiple response data can also be used with correspondence analysis. The table below, which is referred to as a brand association grid by market researchers, is made up of the data from 63 different variables. Each of the 800 respondents in the data set has indicated which brands possess which attitudes. As the data is non-negative, and is all on the same scale, it is a prime candidate for correspondence analysis.

Tables of means (OK)

The table below shows averages. It meets all the requirements for correspondence analysis. (Although, as there are only two rows of data, the resulting map will show all the data points organized along a straight line, which can cause a bit of a panic for you if you are not expecting it.)

Correlations (usually bad)

The next table shows correlations between two sets of variables. Correspondence analysis will not work here as we have negative values.

The next table is the same as the previous one, except that I have added 1 to every number. This means that there are no longer negative results. This data now meets the requirements for correspondence analysis. Sure, there are perhaps better techniques, such as canonical correlation analysis, but we can extract insights from this table using correspondence analysis.

Raw data (usually OK)

The table below shows the raw that we used to compute correlations above. Each row represents a person. Each column indicates the consumption of the different brands either at home or out and about. Is it OK? Raw data is usually OK, provided that the data is either binary or numeric. And, raw data for unordered categorical variables (e.g., occupation, brand preference), will not work, as the data has no meaningful scaling (i.e., averages do not make sense). In this specific case, we have a few rows with no data, and that causes a problem. But, once they are removed, we can get a useful map, as shown below.

The cool thing about using raw data is we can understand the distribution of respondents in the data. The hard thing though is that all the usual rules of interpretation apply, so these plots can be quite difficult to interpret correctly.

Create your own Correspondence Analysis

Time series data (e.g., sales data)

The final example, shown below, shows sales by different retailer categories by month. While you may not think of sales data as being appropriate for correspondence analysis, it satisfies all the criteria.

The visualization below is from of the sales data. It shows department store and clothes/accessory sales are strongly associated with December.

Summary

Correspondence analysis is a powerful tool for simplifying tables. Provided that your data is appropriate - two or more dimensions, no negatives, consistently scaled - it can do the job.

If you have the data, but are unsure where to start, you can create your own correspondence analysis using the template below!

Correspondence Analysis Versus Multiple Correspondence Analysis: Which to Use and When?

Tim Bock — Mon, 22 May 2017 12:27:21 +0000

In this post I explain the difference between the two techniques, and their relative strengths and weaknesses. I assume that you already are familiar with correspondence analysis, but if not, then consider first reading How correspondence analysis works (a simple explanation).

The core difference between correspondence analysis and multiple correspondence analysis

The name, rather obviously, suggests that multiple correspondence analysis should be better than correspondence analysis. Multiple = more = better. And this gets to the root of the problem: the word "multiple" is misleading, except in an exotic technical case, which I will return to below.

For most purposes the way to think about the two techniques is this:

Multiple correspondence analysis is a technique for analyzing categorical variables. It is essentially a form of factor analysis for categorical data. You should use it when you want a general understanding of how categorical variables are related.
Correspondence analysis is a technique for summarizing relativities in tables. As tables are ubiquitous in data analysis, it is a technique that can be used widely.

Both techniques give the same answer when you have two variables. You can also use both of them for more than two variables, but they give different answers. I illustrate this below.

The reason for the word "multiple" is that multiple correspondence can be applied to a table that has more than two dimensions (e.g., a cube), whereas correspondence analysis requires as an input a table with only two dimensions. So, the word "multiple" refers to the number of dimensions of the input table. Below I show you a five dimensional table so you can get a better idea of what this means.

This analysis was done in Displayr. To see Displayr in action, grab a demo here.

Book a demo

An example of multiple correspondence analysis

The scatterplot below shows a multiple correspondence analysis of five variables: voting in the 2008 and 2012 US elections, approval of President Trump, age, and gender. The key conclusions from it are that:

People aged 18 to 24 were less likely to vote and more likely to have no opinion about Trump.
Approval and disapproval is correlated with candidate-party choice for 2012 and 2008.

The problems with multiple correspondence analysis

Difficulty in checking the input data table

The above plot seems pretty useful. And, multiple correspondence analysis can be useful. Nevertheless, it has some serious limitations. The first limitation is that it is extremely difficult to check conclusions by looking at the raw data. Check out the table below. Note the scroll bars: it is a really big table. It is a five-dimensional table. It shows counts, as it is difficult to even think about how to compute percentages on a five-dimensional table.

Inability to confidently evaluate associations

The next limitation relates to the interpretation of the relationships between the variables. As discussed in How to interpret correspondence analysis plots (it probably isn’t the way you think), if we have the appropriate normalization, when we use correspondence analysis we can understand the association between labels from different variables by drawing a line from each label to the origin, and taking into account the lengths of these lines and the angle where they intersect with the origin. Unfortunately, with multiple correspondence analysis, there is no normalization that permits all such comparisons. Consequently, we always need to check the raw data. But, as just discussed, that is not so easy. Hence, with multiple correspondence analysis we have an increased risk of misinterpretation.

Further complicating the problem with looking at associations is that multiple correspondence analysis tends not to explain all the variance. In the map shown above, 16% of the variance is not explained. As we cannot inspect the data, this is a problem.

Messiness

The next problem relates to messiness. With more than five or six variables, the resulting maps are really hard to use. As an example, take a look at the one below. What makes it so hard is that it plots every level of every categorical variable. Often this means that redundant information is plotted (e.g., the "yes" of a two-category, and, at the opposite side of the map, the "no" for the same variable). As I will show you soon, we can get a good visualization of this data using correspondence analysis.

Unfocused

Last, multiple correspondence analysis produces unfocused analyses. What do I mean by this? In the analysis above, we are looking at the relationship between 17 variables relating to traits that people want in a US president, and age. The analysis treats all of the variables as being equally important. It will show the strongest relationships. That means that we will end up with a plot that explains how preference for the traits relates to age if and only if there is a very strong relationship between preferences for these characteristics and age.

This analysis was done in Displayr. To see Displayr in action, grab a demo here.

Book a demo

Correspondence analysis with multiple variables

The messy plot above represents a multi-dimensional table, that has 17 different dimensions. Sixteen of the dimensions have two levels each (i.e. if a person mentioned a trait or or not). The final dimension, age, has six levels. Thus, the input table has 2^16* 6 = 393,216 cells! Much too many to read. The table below uses the same 17 variables, but only has 96 cells. This is because rather than each of the trait variables being an extra dimension with two levels, we are just showing one of the levels and "stacking" them on top of each other as rows in the table. For example, 69% of people aged 18 to 24 said they wished for an American President that was Decent/Ethical, and 33% of people in this age band want The President to be Plain-speaking.

The multiple correspondence analysis shown in the previous section was based on the same data. However, it was based on the underlying variables. With correspondence analysis we first need to create a table. This is really what makes correspondence analysis so useful. We get to create a table in such a way that we focus on what we want to know. As this table compares age by the 16 variables, it will produce a plot that highlights the key relationships between age and the importance of these characteristics.

The resulting plot (shown below) is a lot less messy. It tells us some things that are on face-value surprising. For example, that the 25 to 34 year olds are not so interested in a Christian president. As the analysis is based on a table, we can confirm this conclusion.

Summary

Although multiple correspondence analysis sounds better than correspondence analysis, the truth is the other way around. Multiple correspondence analysis is an obscure technique that can be useful in special circumstances. Correspondence analysis is applicable to the analysis of many different types of tables. As most data appears in a table at one time or another, correspondence analysis is a technique that can be widely applied.

The examples used in this post have all be created in Displayr. You can create your own correspondence analysis in Displayr for free by using the template below!

This analysis was done in Displayr. To see Displayr in action, grab a demo here.

Book a demo

How Correspondence Analysis Works (A Simple Explanation)

Tim Bock — Fri, 19 May 2017 07:37:12 +0000

Correspondence analysis is a data science tool for summarizing tables. This post explains the basics of how it works. It focuses on how to understand the underlying logic without entering into an explanation of the actual math.

A simple example

The table below shows some data on the traits of some animals, with the resulting correspondence analysis map below. This post explains, in simple terms, how the map is computed from the table.

Step 1: Compute row and column averages

In the first step, compute the averages for each row and column, as shown below.

Step 2: Compute the expected values

Next, for each cell, compute what are known in the trade as the expected values. Each cell's expected value is the row average for that cell, multiplied by the column average, and divided by the overall average. So, looking at Big and Dog, we have 35 * 61 / 51 = 42. The following table shows all the expected values.

Step 3: Compute the residuals

The residuals are computed by subtracting the expected values from the original data. Thus, for Dog and Big, the residual is 80 - 42 = 38. The residuals are shown below. These residuals are at the heart of correspondence analysis, so do not skip to the next step until you are really sure you get what they mean.

The residuals show the associations between the row and column labels. Big positive numbers means a strong positive relationship. The opposite is true for negatives. Let us look at the residuals for Dog. We can see that its biggest score is for Friendly. And, its lowest score is for Resourceful. If you look at the original data table at the top of the post, neither of these conclusions should surprise you.

The interesting result in the first row is Animal which, for Dog, sits at 100. But, the residual is only 5, indicating virtually no association between being an animal and being a dog. Why? All rows of the data are animals (and four, like the Dog, mammals). So, while a Dog is an animal, like all the other things in the analysis, this association becomes very weak, which is what is reflected in the residuals.

Step 4: Plotting labels with similar residuals close together

Compare the residuals for Cat with those for Dog. While the Dog residuals are generally larger, most are in the same direction. If you take the time, you will realize that in terms of residuals, Dog and Cat are most similar. The next most similar is Dog and Wallaby. Then comes Rat. Last, the Cockroach is least like the Dog. Now look at the blue labels in the plot below, which represent the rows of the table. The relative position of the other animals from Dog in the visualization is consistent with the similarities of their respective residuals.

Now look at the variance shown in the axes labels of the chart. The horizontal dimension explains 89% of the variance in the data whereas the vertical dimension explains only 8%. You can infer the relative amount explained by each dimension on a well-drawn map. That is, we can see on this map that the points vary much more on the horizontal than on the vertical, and this is why the relative variance explained of the dimension varies so greatly.

Together, these two dimensions explain 97% of the variance. This, in turn, tells us that the map represents almost all of the information in the residuals, which is good news. If, instead, they explained a relatively small amount, the map will not tell us the complete story.

Now look at the columns. Big and Friendly are almost equally large, which is why they are next to each other on the map. The least similar trait to Big is Resourceful, which is why it is on the other side of the map to Big.

Step 5: Interpreting the relationship between row and column labels

Now we come to the tricky bit. Correspondence analysis places the row labels on the plot such that the closer two rows (animals) are to each other, the more similar their residuals. This also applies to the column (traits) labels. Most people conclude then that the greater the proximity between a row label and a column label, then then the higher the residual and association. Wrong. If you think about it for a bit, then you may realize that it is impossible to create a map with such an interpretation (and, good careers have been tarnished in the effort to do it.)

To better understand this, compare Dog and Big with Wallaby and Lucky. Dog and Big are close together. Lucky and Wallaby are almost identically proximate. Recall also that the residual for Dog and Big is very high, at 38. Because of this, as we might expect, they are close together on the map. Nevertheless, the residuals for Wallaby and Lucky is only 2, yet they are even closer together on the map than Dog and Big. What is going on here?

Now, take a look at Cockroach. Its residual for Athletic is high at 42. As this is bigger than the 38 for Dog and Big, intuitively you would want Cockroach and Athletic to be very close together on the map. But, Cockroach has an even bigger residual of 61 for Resourceful, and if we put Cockroach and Athletic next to each other, where can we put Resourceful? There is, in fact, no way to position the labels to sensibly communicate these residuals.

Fortunately, all is not lost. The way that correspondence analysis works means that we can compare between row labels based on distances. We can also compare between column labels based on distances. However, if we want to compare a row label to a column label, we need to:

Look at the length of the line connecting the row label to the origin. Longer lines indicate that the row label is highly associated with some of the column labels (i.e., it has at least one high residual).
Look at the length of the label connecting the column label to the origin. Longer lines again indicate a high association between the column label and one or more row labels.
Look at the angle formed between these two lines. Really small angles indicate association. 90 degree angles indicate no relationship. Angles near 180 degrees indicate negative associations.

Let us work through these rules using some examples. Look at Wallaby and Lucky to the right. The angle is about 30 degrees or so, indicating some form of association. The short lines, however, suggest that the correct interpretation is that there is either no association, or a very weak one.

The plot for Cockroach and Athletic is reproduced to the left. The angle is very small, suggesting an association. The arrows are both, in relative terms, long, suggesting a strong association. As the arrow to Resourceful would be even longer, and the angle marginally smaller, this tells us that Cockroach is even more strongly associated with Resourceful than with Athletics.

I return to this example, and add a whole lot more examples of interpretation, in How to interpret correspondence analysis plots (it probably isn't the way you think).

Want to create your own correspondence analysis but don't know where to start? Use the handy template below!

How to Interpret Correspondence Analysis Plots (It Probably Isn’t the Way You Think)

Tim Bock — Fri, 19 May 2017 07:36:12 +0000

Correspondence analysis is a popular data science technique. It takes a large table, and turns it into a seemingly easy-to-read visualization. Unfortunately, it is not quite as easy to read as most people assume.

In How correspondence analysis works (a simple explanation), I provide a basic explanation of how to interpret correspondence analysis, so if you are completely new to the field, please read that post first. In this post I provide lots of examples to illustrate some of the more complex issues.

1. Check conclusions using the raw data

The key to correctly interpreting correspondence analysis is to check any important conclusions by referring back to the original data. In this post I list 9 other things to think about when interpreting correspondence analysis. But, so long as you always remember this first rule, you will not go wrong.

The reason for this rule is illustrated in the example below. It shows 24 months of sales data by different retail categories. The visualization shows that Department stores are associated with December (i.e., Christmas, Dec-15 and Dec-16). We can see that Food retailing is on the opposite side of the map, which most people would interpret as meaning that Food retailing sales are lower in December.

Now, take a look at the actual data, shown below. Even though Food retailing is a long way from December on the map:

Food retailing has the highest sales in December of any of the categories.
Food retailing's biggest month is December.

How can this be? The data seems to say the exact opposite of visualization? If you have read How correspondence analysis works (a simple explanation), you should understand that this is because correspondence analysis is all about the relativities. If we dig deeper into the data we can see that the map above does make sense, once you know how to read it.

While Food retailing does peak at Christmas, its sales are only 19% above its average monthly sales. By contrast, Department store sales spike to 85% above average in December. This is what correspondence analysis is trying to show us. Correspondence analysis does not show us which rows have the highest numbers, nor which columns have the highest numbers. It instead shows us the relativities. If your interest is instead on which categories sell the most, or how sales change over time, you are better off plotting the raw data than using correspondence analysis.

2. The further things are from the origin, the more discriminating they are

The correspondence analysis plot below is from a big table consisting of 42 rows, each representing a different brand, and 15 columns. You can see the original data here. Correspondence analysis has greatly simplified the story in the data. As you hopefully remember from school, the origin is where the x- and y-axes are both at 0. It is shown below as the intersection of two dashed lines. The further labels are from the origin, the more discriminating they are. Thus, Lee Jeans (at the top) is highly differentiated. Similarly, Outdoorsy is a highly discriminating attribute.

3. The closer things are to origin, the less distinct they probably are

In the map above, we see that Qantas is bang smack in the middle of the visualization. Thus, the conclusion probably is that it is not differentiated based on any of the data in the study. I explain the use of the weasel-word "probably" in the next section.

Here is another example. In the center of the map we have Wallaby and Lucky. Does this mean wallabies are lucky animals? No. They get hit by cars a lot. If you follow rugby, you will know that 99 times out of 100 a Wallaby is no match for even a Kiwi. If you look at the table below, you can see that the Wallaby is pretty average on all the variables being measured. As it has nothing that differentiates it, the result is that it is in the middle of the map (i.e., near the origin). Similarly, Lucky does not differentiate, so it is also near the center. That they are both in the center tells us that they are both indistinct, and that is all that they have in common (in the data).

4. The more variance explained, the fewer insights will be missed

I have reproduced the correspondence analysis of the brand personality data below. You will hopefully recall my mentioning that Qantas being in the middle meant that it was probably not differentiated based on the data. Why did I write "probably"? If you sum up the proportion of variance explained by horizontal and vertical dimensions (shown in the axis labels), we see that visualization displays 57% of the variance in the data. And, remember, this is only 57% of the variance in the relativities. So, a lot of the data has been left out of the summary. Perhaps Qantas is highly differentiated on some dimension that is irrelevant for most of the brands; the only way to know for sure is to check the data.

Now, in fairness to correspondence analysis, it is important to appreciate that it is actually a great achievement for the map to explain 57% of the variation with such a large input table. To represent all of the relativities of this table requires 14 dimensions, but we have only plotted two. Correspondence analysis is not the problem. The problem is the quantity of the data. The more data, the greater the chance that any good summary will miss out important details.

5. Proximity between row labels probably indicates similarity (if properly normalized)

As discussed in some detail in How correspondence analysis works (a simple explanation), we should be able to gauge the similarity of row labels based on their distance on the map (i.e., their proximity). "Should" is another weasel word! Why? Three things are required in order for this to be true:

We need to be explaining a high proportion of variance in the data. If we are not, there is always the risk that the two row labels are highly distinct, but are still shown on the map as if not distinct.
The normalization, which is a technical option in correspondence analysis software, needs to have been set to either principal or row principal. I return to this in the next section.
The aspect ratio of the map needs to be fixed at 1. That is, the horizontal and vertical coordinates of the map need to match each other. If your maps are in Excel or, as in the example below, PowerPoint, you may well have a problem. In the chart below, the really big pattern is that there is an enormous gap between the pro-Trump camp, on the far left, and the pro-Clinton camp on the far right. If you have even a passing understanding of American politics, this will make sense. However, if you look at the scale of the labels on the x- and y- axes you will see a problem. A distance of 0.2 on the horizontal is equivalent to a distance of 0.6 on the vertical. The map below this has the aspect ratio set to 1, and it tells a different story. Yes, the pro- and anti-Trump camps are well apart, but the disenfranchised youth are now much more prominent.

6. Proximity between column labels indicates similarity (if properly normalized)

This is a repeat of the previous point, but applying to columns. Here, the normalization needs to be either principal or column principal. You may recall me writing in the previous point that to compare between rows, we need to be using either principal or row principal normalization. So, setting the normalization to principal seems the obvious solution. But, before jumping to this conclusion, which has its own problems (as discussed in the next section), I will illustrate what these different normalization settings look like. The visualization below is based on the principal normalization. Principal is the default in some apps, such as Displayr, Q, and the R package flipDimensionReduction. However, it is not the default in SPSS, which means that comparing the distances between rows labels in a map created by SPSS with defaults is dangerous.

The plot below uses the column principal normalization. If you look very carefully, you will see that the positions of the column points are unchanged (although the map has been zoomed out). But, the positions of the row labels, representing the brands, have changed. There are two ways that the row labels positions have changed. First, they have been stretched out to be further form the origin. Second, the degree of stretching has been greater vertically. With the principal plot shown above, the horizontal differences for the row labels are, in relative terms, bigger. With the column principal shown below, the vertical differences are bigger. So, to repeat the point made a bit earlier: the distances between the column points are valid for both principal and column principal, but the distances between the row points are not correct in the column principal shown below.

The visualization below shows the row principal normalization. Now the distances between the row labels are meaningful and consistent with those shown in the principal normalization, but the differences between the column coordinates are now misleading.

7. If there is a small angle connecting a row and column label to the origin, they are probably associated

Take a look at the plot above. Would you say Lift is more strongly associated with Cheers you up or Relax? If you have said Relax, you are interpreting the map correctly. As discussed in How correspondence analysis works (a simple explanation) it is wrong to look at the distance between row labels and column labels. Instead, we should imagine a line connecting the row and column labels with the origin. The sharper the angle, the stronger the relationship. Thus, there is a strong relationship between Relax and Lift (although, if you look at the data shown below, you will see that Lift is very small, so it does not in any sense "own" Relax).

If you have not yet had your coffee for the day, go get it now. We are at the hard bit. In the plot above, the angles are informative. However, interpreting the angles is only strictly valid when you have either row principal, column principal, or symmetrical (1/2) normalization. So, if wanting to make inferences about the relationships between the rows and columns (e.g., brands and attributes in the examples above), we are better off not using the default principal normalization. This is really the yuckiest aspect of correspondence analysis. No one normalization is appropriate for everything. Or, stated from a glass half full perspective, our choice of normalization is really a choice of how we want to mislead the viewer!

Additional complexity is added to this problem by the tendency of people not to report the normalization. Fortunately, we can make an educated guess based on the dispersion of the points (if the rows points are all near the origin we probably have row principal, and vice versa for columns).

Depending on the day of the week I have two ways of dealing with this issue. Most of the time, my preference is to use the principal normalization, and remind viewers to check everything in the raw data. Sometimes though, where my interest is principally in the rows of a table, I use row principal and a moon plot. Distances between the brands are plotted inside of a circle and these distances are meaningful. The column labels are shown on the outside of the circle. They have the same angles as on the plot above. But, now the font size represents what was previously indicated by the distance between the column labels and the origin. The beauty of this representation is that we can now compare distances between column labels and points, so the plot is much harder to misread, and we have no need to educate the reader about the whole reading of angles. The information regarding the relativities of the column labels is harder to gauge, but, this is arguably beneficial, as the construction of the plot makes it clear that the focuses is on the rows (brands).

8. A row and column label are probably not associated if their angle to the origin is 90 degrees

In the moonplot above, if you draw a line connecting Red Bull to the Origin, and back out to Kids, you will see that it is roughly a right-angle (90 degrees). This tells us that there is no association between Kids and Red Bull. Again, I have written "probably". If you look at the data, shown in the table above, there is clearly a negative association. Remember, always look at the data!

9. A row and column label are probably negatively associated if they are on opposite sides of the origin

The plot below shows the traits that people want in an American president by age. What do the 25 to 34 year old yearn for? The is a strong association with Entertaining. What is the next strongest association? You may think it would be concern about global warming and minorities. This is not the case. The next strongest associations are negative ones: the 25 to 34 year olds are less keen on a Christian President, one who has been successful in business, and one who is plain-speaking. We can see this because these traits are on the opposite side of the origin, and are a long way from the origin, whereas the traits relating to global warming and welfare of minorities are all closer to the origin, and thus are less discriminating.

Here's another example. The correct way to read this visualization is that Yahoo is, in relative terms, stronger than Google on Fun. However, if you look at the raw data it shows that Google is much more fun than Yahoo (54% versus 28%). The reason that Yahoo has stronger association with Fun is that it is its second best performing attribute (with 29% for Easy-to-use). By contrast, while Google is twice as fun as Yahoo, it scores three times as high on High quality, and High performance, which are on the opposite side of the map, and this is what drags Google away from Yahoo.

10. The further a point from the origin, the stronger their positive or negative association

The visualization below shows movement of Yahoo's perceptions from 2012 to 2017, with the arrow head showing 2017 and the base of the arrow showing 2012. The obvious way to read this is that Yahoo has become more fun, more innovative, and easier-to-use. However, such a conclusion would be misplaced.

A better interpretation is:

In 2012, the angle formed by connecting the base of Yahoo to the origin and back to Fun is very small, which tells us that they are associated.
As Fun is relatively far from the origin we know that Fun is a relatively good discriminator between the brands.
As Yahoo was very far from the origin, and associated with Fun, we can conclude that Yahoo and Fun were closely associated in 2012 (remember, correspondence analysis focuses on relativities; in 2012 Yahoo's Fun score was half of Google's).
From 2012 to 2017, Yahoo moved much closer to the origin, which tells us that Yahoo's relative strengths in terms of Fun, Easy-to-Use, and Innovative, have likely declined (and, in reality, they have declined sharply; Google is now more than four times as fun).

Summary

It is really, really, important to always check key conclusions from correspondence analysis by inspecting the raw data.

Software

Hopefully you like the look of the plots in this post! They can all be created in R using the Displayr/flipDimensionReduction package, or in Displayr and Q via the menus. More detail about the various plots shown in this post, and R code, can be found in the other correspondence analysis posts on this blog.

Easily create your own correspondence analysis by using the template above!

Easily Add Images to a Correspondence Analysis Map in Displayr

Tim Bock — Wed, 17 May 2017 09:53:57 +0000

You can take your correspondence analysis plots to the next level by including images. Better still, you don’t need to paste in the images after the analysis is complete – you can include them right from the start.

The visualization above shows the results of a correspondence analysis based on data from a study of how people perceive different carbonated soft drinks. Logos, which are from jpeg files, are shown instead of brand names, with lines and dots indicating the precise location of the brands.

Click here for an interactive tutorial on Correspondence Analysis

Creating such visualizations in Displayr is straightforward:

Create logos that are of similar sizes, and put them somewhere on the web. The most straightforward way to do this is to use Dropbox and share the files. Make sure that you check this has worked by pasting the URLs into your browser; if you do not see the logo, something has gone wrong.
Insert > Dimension Reduction > Correspondence Analysis of a Table.
Select the data you wish to analyze in the Object Inspector > Inputs > Correspondence Analysis of a Table > Input table(s) field. If you want to instead manually enter the data, click on Paste or type table paste the data in and click OK.
In the Object Inspector > Inputs > Correspondence Analysis of a Table check Use logos for rows. If you want to instead use logos for the data in the columns, check Switch rows and columns.
Paste your URLs, with commas between them and quotation marks around them, into the Logos box. The order of the images in the list should match the order of the rows in the table. The list of URLs that I used in the example above looks like this:

 
http://docs.displayr.com/images/9/90/Coke.png,
http://docs.displayr.com/images/7/7c/V.jpg,
http://docs.displayr.com/images/8/82/RedBull.jpg,
http://docs.displayr.com/images/d/dc/LifePlus.jpg,
http://docs.displayr.com/images/0/09/DietCoke.png,
http://docs.displayr.com/images/d/da/Fanta.jpg,
http://docs.displayr.com/images/e/e8/Lift.png,
http://docs.displayr.com/images/5/5e/Pepsi.jpg

The resulting visualization, shown below, is interactive. You can manually move the logos and the labels around.

Click here for an interactive tutorial on Correspondence Analysis