Principal Component Analysis - Displayr

Working with Principal Components Analysis Results

Tim Ali — Tue, 15 Sep 2020 00:28:57 +0000

Principal Components Analysis (PCA) is a technique for taking many variables and creating a new, smaller set of variables. These aim to capture as much of the variation in the data as possible. In this post, we show you how to save, access, and export the PCA results and output. For information on how to set up and run the PCA, see How to Do Principal Components Analysis in Displayr.

Principal Component Loadings

The default PCA output is the Principal Components Loadings table which shows one row for each of the original variables. From the same example used in How to Do Principal Components Analysis in Displayr, each of the 8 new variables or components identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the original variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. To display these values, deselect the Suppress small coefficients checkbox.

Saving Component Scores

To save a set of respondent level component score variables from the PCA output, select:

Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions

This creates a set of variables for each component at the top of the Data Sets tree grouped together as a question called Scores from dim.reduce. These scores are standardized respondent level component scores with a mean of 0 and standard deviation of 1 across the entire sample. You can then rename the component variables based on the attributes to which they most closely correlate. To do this, select each of the component variables group under Scores from dim.reduce in the Data Sets tree, right-click, and select Rename.

The new variables are linked back to your PCA output. This means that if you change any of the input options and then calculate the PCA again, the scores will also update automatically based on the updated analysis. If you change the number of components in the analysis, you should delete the variables for the scores in the Data Sets tree and save a new set of scores.

As an alternative, you can also save the component score variables as follows:

1. From the Insert menu, select R > Numeric Variable
2. In the R CODE field, paste in the code here (where dim.reduce is the name of the output that you've previously created):

[sourcecode language="r"]
fitted(dim.reduce)
[/sourcecode]

3. Click the Calculate button to run the code.
4. Allocate a Question Name and Label in GENERAL.

Exporting PCA Results

To export the Rotated Loadings table, select the PCA output and then from the menu select Export > Excel. Select Current Selection and then click the Export button. An Excel file containing the loadings table will be exported.

You can also generate an R output of the loadings table by selecting Insert > R Output (in the Analysis group) from the menus, then enter the following R code and click the Calculate button.

[sourcecode language="r"]
dim.reduce$rotated.loadings
[/sourcecode]

This will generate an unsorted R table containing the loading coefficients which can also be exported to Excel. You can adjust the number of decimal places using the decimal options on the Appearance menu. Note that this is based on a PCA name dim.reduce which is the default PCA object name in Displayr. If you've renamed your PCA analysis, you'll need to make the change in the code as well.

If you instead want to export the respondent level component scores, you can do so by creating a raw data table and then export this to Excel. To do this, from the menu select Insert > More > Tables > Raw Data. Next, select each of the component scores from the Variables drop-down list in the Object Inspector. Click the Calculate button to generate the output. This output can now be exported by selecting an option from the Export menu.

Learn More about Dimension Reduction in Displayr

Tim Ali — Wed, 09 Sep 2020 04:00:43 +0000

Correspondence Analysis

Webinar: DIY Market Mapping Using Correspondence Analysis

Ebook: DIY Correspondence Analysis

How Correspondence Analysis Works (A Simple Explanation)

Understanding the Math of Correspondence Analysis

How to Interpret Correspondence Analysis Plots

Correspondence Analysis Versus Multiple Correspondence Analysis

Principal Component Analysis

Principal Component Analysis (Wiki example)

How to Do Principal Components Analysis in Displayr

The Basic Mechanics of Principal Components Analysis

Principal Component Analysis of Text Data

Varimax Rotation

Component Score Coefficient Matrix

Kaiser Rule

Determining the Number of Components in Principal Components Analysis

Validating Principal Components Analysis

Common Misinterpretations of Principal Components Analysis

Text Analysis - Advanced - Principal Components Analysis (Text)

Saved Principal Components Analysis Variables

Multidimensional Scaling and t-SNE

What is Multidimensional Scaling (MDS)?

t-SNE

How t-SNE Works

Goodness of Fit in MDS and t-SNE wit Shepard Diagrams

How to Do Principal Components Analysis in Displayr

Tim Ali — Wed, 02 Sep 2020 01:09:25 +0000

Data setup

Principal Components Analysis always views data numerically. This means that you need to be careful with the question Structure assigned to your variables to ensure the analysis views their numeric values. The variables in a PCA should be part of a Numeric, Numeric - Multi, or Binary - Multi question.

In most cases, you should set your variables up as Numeric or Numeric - Multi. The variables do not need to be grouped together. Remember, they could come from different questions, but they should all be on the same scale (that is, don’t mix 5-po int scales with binary variables or 10-point scales). Binary - Multi is appropriate to use when the data are binary.

If your variables are not set up as Numeric, Numeric - Multi, or Binary - Multi, you can:

Locate the variables in the Data Sets tree.
(Optional) Make new copies of the variables by selecting them, and from the menu choosing Home > Duplicate.
From the Object Inspector on the right side of the screen, change the Structure to either:
1. Numeric, if there’s a single numeric variable,
2. Numeric - Multi, if you have multiple numeric variables that are grouped together, or
3. Binary - Multi, for binary variables.

In this article, I am using an example of a 5-point scale (called “Q23. Attitudes”). We asked several statements about our respondents' mobile phone use. Originally, the variables were set up as a Nominal - Multi question, which is typically how looped scales like this will appear in Displayr. In my screenshot below, I made a copy of the question for use in the PCA, and then set the Structure to Numeric - Multi.

Creating the Principal Components Analysis

To create the PCA in Displayr:

Select Insert > Dimension Reduction > Principal Components Analysis.
In the Object Inspector on the right side of the screen, choose the variables that you want to analyze in the Variables box.
Tick Automatic, which ensures the PCA will remain up to date when the data changes or when you change the settings.

The output from the PCA is what is known as a loadings table. This table shows one row for each of my original mobile phone statement variables (there are 23). Each of the 8 new variables identified by the PCA appears in the columns. The cells of the table show figures referred to as loadings.

These loadings represent the correlations between the new variables and the old variables. As correlations, they will always range between -1 and 1. A score towards 1 indicates a strong positive relationship, a score towards -1 indicates a strong negative relationship, and scores closer to 0 indicate weaker or non-existent relationships. The output omits smaller correlations. However, the bar remains to indicate their values. Change this by toggling the Suppress small coefficients box.

The table is sorted in a way that makes it easy to work out what the 8 new variables mean. The first variable (“Component 1”) shows a strong correlation with the variables for “Want to view videos”, “Want video phone”, “Want to play music”, “Like fast internet on phone”, and “Do mobile banking”. We conducted this study before the age of the smartphone. At the time, these higher-technology features were uncommon in phones.

This new variable thus represents an underlying factor of desire for better technological capabilities in phones. The second variable strongly correlates with variables that reveal a desire to stay in touch and connected. The third variable represents an attitude that phones need only make calls or have basic functionality, and so on.

The output also tells us a number of key bits about the analysis:

The 8 components represent 57.7% of the original variance in the data. You inevitably lose some information when you reduce variables like this.
The first variable (“Component 1”) accounts for 12.8% of the variation. The second accounts for 8.63% of the variation, etc. The sort order goes from most variation to the least variation.
The footer contains additional sample size information and settings info.

In the next few sections, I’ll explain some settings that we didn’t change, and how to save the new variables to your data set so you can use them elsewhere.

Determining the number of components

In the analysis above, the PCA automatically generated 8 variables. It did this using a heuristic known as the Kaiser rule, an option in the Rule for selecting components drop-down menu. This is a commonly used rule, but you can also choose to use two other methods:

Number of components. Choose this option if you want to choose the number of components to keep.
Eigenvalues over. Eigenvalues are numbers associated with each component, and these are listed at the top of each column. This setting lets you specify the cut-off value for components.

Rotations

In the analysis above, I used a technique called Varimax rotation, Displayr’s default option in the Rotation method drop-down menu. The concept of the rotation can be a bit abstract to talk about without getting into the mathematics of the technique. Putting it simply, the PCA problem can have an infinite number of solutions which all capture the same amount of variation in the data. The rotation tries to find which of those many solutions is the easiest to write down an interpretation for, by writing them in a way so that as many loadings are as close to zero (or to a value of 1) as possible.

If you have a favorite rotation method to use, the Rotation method drop-down menu contains several other options. They are all described in mathematical terms, so discussing them here would not add much value if you don’t already have a preferred technique. In my experience, Varimax seems to be the most popular.

Saving variables

To use the results of the PCA in another analysis you need to save the variables to your data set. To do so:

Have your PCA output selected on the page.
From the menu select Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions. This will add the new variable set to the top of the Data Sets tree.
(Optional) Right-click on the row labels in the variable set and Rename them, to make the components more recognizable.

Now, you can create a table from the component scores. The table will be full of 0s, indicating that the average score of each is zero. Don’t be alarmed! This occurs because the variables are standardized – with a mean of zero and a standard deviation of 1 – which is the standard technique. If you create a crosstab with another question, then the variation between variables will become more apparent. For instance, I renamed my components and created a table with the Age groups from the study:

Rather unsurprisingly, the younger people have higher scores on the “Want technology” and “Cost-sensitivity” components, and a much lower score on the “Only use the basics” component.

These new variables can be used just like any other in Displayr. Once you are happy with your new components, go back to the PCA output, and untick the Automatic box. This will prevent any changes to the components. If you modify your PCA later and change the number of components in the solution, you should delete the saved variables and run Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions again.

Hopefully, you find that Principal Components Analysis is easy to do in Displayr, and by saving the variables you can use it to complement your other analyses. Don’t forget the three main steps: set up your data correctly, create the analysis output, and use the output to save your new variables. Good luck and happy dimension reducing!

Principal Component Analysis of Text Data

Tim Bock — Sun, 03 May 2020 04:40:14 +0000

Worked example: Understanding attitude towards Tom Cruise

This post analyzes text data where people have listed their reasons for not liking Tom Cruise. The raw data is shown in the table below.

One component

By default, Displayr creates a PCA with two components, but to explain the technique I'm going to start by looking at the result with a single component. With one component, the PCA of text data seeks to find a single numeric variable that best explains differences in text.

The table of loadings below shows the correlation of different words and phrases with the numeric variables that describe the text. The way to read it is as follows:

The strongest correlation is for people that have used the word nothing (or a synonym) anywhere in their text.
The slightly weaker correlations for Exclusive: nothing is for people who mentioned nothing, but didn't mention it as a part of a bigram (a pair of words that appear commonly together).
Stem: not is the correlation of the word not and any words that commence with not (e.g., not, nothing) with the numeric variable.
nchars is the number of characters in the text. As it has a negative correlation it means that the more somebody typed, the lower their score on the variable that has been identified.
The first component is negatively correlated with Negative sentiment (i.e., the higher the score, the higher the negative sentiment, and thus high scores on the variable correspond to positive sentiment).

Putting all the results together tells us that if we have to summarize the text data as a single numeric variable, that variable measures whether they said Nothing on one end of the continuum, or didn't say nothing on the other.

The table below shows the numeric variable that has been computed. We can see, for example, that respondent 10 has said nothing and has a relatively high score (2.3). Respondent 1's answer isn't purely Nothing, which is why his score is closer to 0 (the average). By contrast, respondents who didn't write nothing have negative scores.

Two components

The table below shows the loadings from the two component solution. The first component has essentially the same meaning as in the first analysis. But, if you scroll down, you will see that the second component is measuring whether or not somebody didn't say (note the negative correlation) tom cruise. This second component measures at one end mentioning Tom Cruise and like, and at the other end not mentioning Tom Cruise and not mentioning like.

Four components

When we look at the four component solution, we end up with four variables that have the following interpretation:

First component variable - whether the text said nothing or similar variants described in the other first variable situations above.
Second component variable - whether the text mentions like or actor.
Third component variable - whether the text has Stem: scientolog (i.e., scientology or scientologist and any misspellings beginning with scientolog). Also words that have synonyms with faith are positively correlated with this variable.
Fourth component variable - Not mentioning crazy.

The table below shows the raw values of the four variables, sorted by the fourth variable (lowest to highest). We can easily see here that the further the value below zero on the fourth variable, the more likely they were to reveal they regarded Tom Cruise as being crazy.

This analysis is useful in its own right, as a summary of the key trends in the data. And, the variables can be used as inputs into other analyses, such as cluster analysis or latent class analysis (segmentation).

Selecting the number of components

How many components should you have? This is likely best determined by by judgment. Choose the number which leads to a result that makes sense.

An alternative is a scree plot. The basic idea is that you imagine that the plot is showing an arm, and you want to have the number of components that occurs at around the "elbow". In this example we have a double jointed elbow, so the plot at best tells us that 10 or fewer components is appropriate. As mentioned in the previous paragraph, my recommendation is to just use judgment.

One common heuristic for selecting the number of components is to use the Kaiser rule (eigenvalues > 1). Such rules aren't practical when using PCA for text data. This is because the PCA has 512 dimensions, and pretty much any traditional heuristic for determining the number of dimensions will recommend too many dimensions (e.g., with this example, the Kaiser rule suggests 81 components).

Instructions for conducting principal component analysis of text data

To conduct the analysis in:
- Displayr: Insert > Text Analysis > Advanced > Principal Components Analysis (Text)
- Q: Create > Text Analysis > Advanced > Principal Components Analysis (Text)
Set the text variable in the Variable field.
Specify the desired Number of components.
Press ACTIONS > Save variables to save the variables to the data file.

How it works

The text data is cleaned
If necessary it is translated into English
It is converted into 512 numeric variables using Google's Universal Sentence Encoder for English.
A PCA is performed on the 512 numeric variables and the scores are extracted
A term-document matrix is created from the cleaned text data, along with sentiment analysis, and some related variables.
The loadings are computed as the cross-correlation matrix of the term-document matrix (rows) and the PCA scores (columns).
A varimax type rotation is applied to the loadings.

Factor Analysis and Principal Component Analysis: A Simple Explanation

Tim Bock — Sun, 12 Aug 2018 13:00:23 +0000

What is factor analysis and principal component analysis?

Factor analysis and principal component analysis identify patterns in the correlations between variables. These patterns are used to infer the existence of underlying latent variables in the data. These latent variables are often referred to as factors, components, and dimensions.

The most well-known application of these techniques is in identifying dimensions of personality in psychology. However, they have broad application across data analysis, from finance through to astronomy. At a technical level, factor analysis and principal component analysis are different techniques, but the difference is in the detail rather than the broad interpretation of the techniques.

Create your own factor analysis

A worked example

The table below shows a correlation matrix of the correlations between viewing of TV programs in the U.K. in the 1970s. Each of the numbers in the table is a correlation. This shows the relationship between the viewing of the TV program shown in the row with that shown in the column. The higher the correlation, the greater the overlap in the viewing of the programs. For example, the table shows that people who watch World of Sport frequently are more likely to watch Professional Boxing frequently than are people who watch Today. In other words, the correlation of .5 between World of Sport and Professional Boxing is higher than the correlation of .1 between Today and Professional Boxing.

The table below shows the data again, but with the columns and rows re-ordered to reveal some patterns. Looking at the top left of the re-ordered correlation matrix, we can see that the people who watch any one of the sports programs are more likely to watch one of the other sports programs. Similarly, if we look at the bottom right of the matrix we can see that people who watch one current affairs program are more likely to watch another, and vice versa.

Where a set of variables is correlated with each other, a plausible explanation is that there is some other variable with which they are all correlated. For example, the reason that viewership of each of the sports programs is correlated with each other may be that they are all correlated with a more general variable: propensity to watch sports programs. Similarly, the factor that might explain the correlation among viewership of the current affairs program may be that people differ in terms of their propensity to view current affairs programs. Factor analysis is a statistical technique that attempts to uncover factors.

The table below shows the rotated factor loadings (also known as the rotated component matrix) for the U.K. TV viewing data. In creating this table, it has been assumed that there are two factors (i.e., latent variables). The numbers in the table show the estimated correlation between each of the ten original variables and the two factors. For example, the variable that measures whether or not someone watches Professional Boxing is relatively strongly correlated with the first factor (0.73) and has a slight correlation with the second factor (0.086). The first factor seems to be the propensity to watch sports and the second seems to be the propensity to watch current affairs.

When conducting factor analysis and principal component analysis, decisions need to be made about how many factors should be selected. By default, programs use a method known as the Kaiser rule. However, this rule is only a rule of thumb. It is often useful to consider alternative numbers of factors and select the cluster with the highest number of factors.

Create your own factor analysis

The difference between factor analysis and principal component analysis

The mathematics of factor analysis and principal component analysis (PCA) are different. Factor analysis explicitly assumes the existence of latent factors underlying the observed data. PCA instead seeks to identify variables that are composites of the observed variables. Although the techniques can get different results, they are similar to the point where the leading software used for conducting factor analysis (SPSS Statistics) uses PCA as its default algorithm.

Create your own factor analysis

Acknowledgements

The correlation matrix presented in this article is from Ehrenberg, Andrew (1981): “The Problem of Numeracy Article,” The American Statistician, 35(2):67-71.

Now that you're more familiar with factor analysis and principal component analysis, you can create them quickly in Displayr.