Cluster Analysis - Displayr

How to do K-means Cluster Analysis in Displayr

Tim Ali — Mon, 17 Jun 2019 03:43:27 +0000

Choose your clustering variables

To run the k-Means in Displayr, we first need to select the variables that we want use as inputs to the segmentation, what are commonly called the clustering variables. In the example below, we'll use a behavioral and attitudinal statement battery on mobile technology. Questions were asked as a 5-point agree/disagree scale. We'll use the top 2 box responses to each of the statements as the inputs to our k-Means cluster analysis.

You can use any other numeric variables as clustering variables that can potentially provide differentiation between the respondents and therefore help define the clusters.

Check out the interactive tutorial on running k-Means

Running the k-Means Cluster Analysis

To setup the cluster analysis in Displayr, select Insert > Group/Segment > K-Means Cluster Analysis. A cluster analysis object will added to the current page. The next step is to add the input variables to the cluster analysis. In this case, we'll select the 11 behavioral/attitudinal statements from the Variables drop-down under in the Inputs section on the right. If the variables are grouped in a Variable Set, then the Variable Set may be selected instead, which is more convenient than selecting multiple variables.

Next, we select the number of clusters that we want to create. I have opted to create 3 clusters, but you can choose anything you want here. For this example, we'll leave the rest of the inputs with the default values selected. The following table of means output is generated.

Interpreting the Results

The standard table of means output shown above lists each of the clustering variables in the rows and shows the mean Top 2 Box percentage for each of the clusters.

The size of each cluster (n) is shown in the column header.
The red and blue highlights indicate whether or not the Top 2 Box score is higher (blue) or lower (red) than the overall mean. The red and blue colors are also scaled to provide some additional differentiation (darker shades of red/blue are farther from the mean).
Means in bold font are significantly higher/lower than the mean score.
The R-Squared value shows proportion of variance in the cluster assignment that is explained by the each of the clustering variables. In the example above, we can see that there are 4 statements that have a greater impact on the segment/cluster predictions than do the remaining variables.
The p-value shows which statement variables are significant in the model.

Saving Cluster Membership

Individual respondents can be assigned to the individual clusters in Displayr by first selecting the k-Means Cluster Analysis output and then selecting Insert > Group/Segment > Save Variable(s) > Membership. A new categorical variable is added to the top of the data set called "Segment/Cluster memberships from kmeans". Locate the new variable in the Data Sets tree and hover over it to preview the respondent level membership data or drag the variable onto the page to create a table.

This segment/cluster variable can be used for profiling against your demographic variables. Once you've identified the key differences between your clusters, try to come up with names that describe each cluster. You can add then these names to the cluster variable by first selecting the variable in the Data Sets tree, click the Labels button from the Properties on the right and enter your the cluster names in the Label column. Click OK to save the cluster names.

Check out the interactive tutorial on running k-Means

How to Work Out the Number of Clusters in Cluster Analysis

Tim Bock — Wed, 02 May 2018 20:53:34 +0000

Penalized fit heuristics

Provided there are no technical errors, it should always be the case that the more clusters you have, the better the clusters will fit the data. At some point, however, adding the number of clusters will overfit the data. Penalized fit heuristics are metrics that start with a computation of fit, and then penalize this based on the number of clusters.

Dozens and perhaps hundreds of penalized fit heuristics have been developed, such as the Bayesian information criteria (BIC), the gap statistic, and the elbow method (where the penalty factor is passed on the perceptions of the analyst rather than a cut-and-dried rule).

A practical challenge with all penalized fit heuristics is that they tend to be optimized to work well for a very specific problem but work poorly in other contexts. As a result, such heuristics are not in widespread use.

Statistical tests

Statistical tests, such as likelihood ratio tests, can also be used to compare a different number of clusters. In practice, these tests make very strong and difficult-to-justify assumptions, and none of these tests has ever been widely adopted.

The extent of association with other data

This approach involves assessing the extent to which each cluster solution (i.e., the two-cluster solution, the three-cluster solution, etc.) is associated with other data. The basic idea is that the stronger the association with other data, the greater the likelihood that the solution is valid, rather than just reflecting noise.

A practical challenge with this approach is that any truly novel and interesting finding is one that does not relate strongly to existing classifications.

Replicability

Replicability is computed by either randomly sampling with replacement (bootstrap replication) or splitting a sample into two groups. Cluster analysis is conducted in the replication samples. The number of classes that get the most consistent results (i.e., consistent between the samples), is considered to be the best. This approach can also be viewed as a form of cross-validation.

Two challenges with this approach are that local optima may be more replicable than global optima (i.e., it may be easier to replicate a poor solution than a better solution), and that replicability declines based on the number of clusters, all else being equal.

No small classes

The basic idea of this approach is that you choose the highest number of classes, such that none of the classes are small (e.g., less than 5% of the sample). This rule has long been used in practice as a part of the idea of domain-usefulness but has recently been discovered to also have some theoretical justification (Nasserinejad, K, van Rosmalen, J, de Kort, W, Lesaffre, E (2017) Comparison of criteria for choosing the number of classes in Bayesian finite mixture models. PloS one, 12).

A weakness of this approach is the difficulty of specifying the cutoff value.

Domain-usefulness

Perhaps the most widely used approach is to choose the solution that appears, to the analyst, to be the most interesting.

What are the Strengths and Weaknesses of Hierarchical Clustering?

Tim Bock — Tue, 10 Apr 2018 06:38:48 +0000

If you want to do your own hierarchical clustering, use the template below - just add your data!

The strengths of hierarchical clustering are that it is easy to understand and easy to do. The weaknesses are that it rarely provides the best solution, it involves lots of arbitrary decisions, it does not work with missing data, it works poorly with mixed data types, it does not work well on very large data sets, and its main output, the dendrogram, is commonly misinterpreted. There are better alternatives, such as latent class analysis.

Easy to understand and easy to do…

There are four types of clustering algorithms in widespread use: hierarchical clustering, k-means cluster analysis, latent class analysis, and self-organizing maps. The math of hierarchical clustering is the easiest to understand. It is also relatively straightforward to program. Its main output, the dendrogram, is also the most appealing of the outputs of these algorithms.

… But rarely provides the best solution

The scatterplot below shows data simulated to be in two clusters. The simplest hierarchical cluster analysis algorithm, single-linkage, has been used to extract two clusters. One observation -- shown in a red filled circle -- has been allocated into one cluster, with the remaining 199 observations allocated to other clusters.

It is obvious when you look at this plot that the solution is poor. It is relatively straightforward to modify the assumptions of hierarchical cluster analysis to get a better solution (e.g., changing single-linkage to complete-linkage). However, in real-world applications the data is typically in high dimensions and cannot be visualized on a plot like this, which means that poor solutions may be found without it being obvious that they are poor.

Arbitrary decisions

When using hierarchical clustering it is necessary to specify both the distance metric and the linkage criteria. There is rarely any strong theoretical basis for such decisions. A core principle of science is that findings are not the result of arbitrary decisions, which makes the technique of dubious relevance in modern research.

Missing data

Most hierarchical clustering software does not work with values are missing in the data.

Data types

With many types of data, it is difficult to determine how to compute a distance matrix. There is no straightforward formula that can compute a distance where the variables are both numeric and qualitative. For example, how can one compute the distance between a 45-year-old man, a 10-year-old-girl, and a 46-year-old woman? Formulas have been developed, but they involve arbitrary decisions.

Misinterpretation of the dendrogram

Dendrograms are provided as an output to hierarchical clustering. Many users believe that such dendrograms can be used to select the number of clusters. However, this is true only when the ultrametric tree inequality holds, which is rarely, if ever, the case in practice.

There are better alternatives

More modern techniques, such as latent class analysis, address all the issues with hierarchical cluster analysis.

How to Deal with Missing Values in Cluster Analysis

Tim Bock — Thu, 05 Apr 2018 20:21:41 +0000

Most of the widely used cluster analysis algorithms can be highly misleading or can simply fail when most or all the observations have some missing values. There are five main approaches to dealing with missing values in cluster analysis: using algorithms specifically designed for missing values, imputation, treating the data as categorical, forming cluster based on complete cases and allocating partial data to clusters, and forming clusters using only the complete data.

The different approaches have been ordered in terms of how safe they are. The safest techniques are introduced first.

Cluster analysis techniques designed for missing data

With very few exceptions, most of the cluster analysis techniques designed explicitly to deal with missing data are called latent class analysis rather than cluster analysis. There are some technical differences between the two techniques, but ultimately, latent class analysis is just an improved version of cluster analysis, where one of the improvements is the way it deals with missing data.

Impute missing values

Imputation refers to tools for predicting the values that would have existed were the data not missing. Provided that you use a sensible approach to imputing missing values (and replacing missing values with the average of their other values is not a sensible approach), running cluster or latent class analysis on the imputed data set means that the missing data is treated in a better way than occurs by default when using cluster analysis (by default, most cluster analyses methods make an assumption that data is missing completely at random (MCAR), which is both a strong assumption and one that is rarely correct; using imputation implicitly involves making the more relaxed assumption that data is missing at random (MAR), which is better.

Use techniques developed for categorical data

Cluster and latent class techniques have been developed for modeling categorical data. When the data contains missing values, if the variables are treated as categorical and the missing values are added to the data as another category, then these cluster analysis techniques developed for categorical data can be used.

At a theoretical level, the benefit of this approach is that it makes the fewest assumptions about missing data. However, the cost of this assumption is that often the resulting clusters are largely driven by differences in missing values patterns, which is rarely desirable.

Form clusters based on complete cases, and then allocate partial cases to segments

A popular approach to clustering with missing values is to cluster only observations with complete cases, and then assign the observations with incomplete data to the most similar segment based on the data available. For example, this approach is used in SPSS with the setting of Options > Missing Values > Exclude case pairwise.

A practical problem with this approach is that if the observations with missing values are different in important ways from those with no missing values, this is not going to be discovered. That is, this method assumes that all the key differences of interest are evident in the data where there are no missing values.

Form clusters based only on complete cases

The last approach is to ignore the data that has missing values, and perform the analysis only on observations with complete data. This does not work at all if you have missing values for all cases. Where the sample size with complete data is small, the technique is inherently unreliable. Where the sample size gets larger, the approach is still biased except where the people with missing data are identical to the observations with complete data, except for the “missingness” of the data. That is, this approach involves making the strong MCAR assumption.

What is Cluster Analysis?

Tim Bock — Tue, 03 Apr 2018 04:30:05 +0000

Cluster analysis refers to algorithms that group similar objects into groups called clusters. The endpoint of cluster analysis is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other. For example, in the scatterplot below, two clusters are shown, one by filled circles and one by unfilled circles.

The required data for cluster analysis

Typically, cluster analysis is performed on a table of raw data, where each row represents an object and the columns represent quantitative characteristic of the objects. These quantitative characteristics are called clustering variables. For example, in the table below there are 18 objects, and there are two clustering variables, x and y. Cluster analysis an also be performed using data in a distance matrix.

Why is cluster analysis used?

In the example above, it is easy to detect the existence of the clusters visually because the plot shows only two dimensions of data. Typically, cluster analysis is performed when the data is performed with high-dimensional data (e.g., 30 variables), where there is no good way to visualize all the data.

The outputs from k-means cluster analysis

The main output from cluster analysis is a table showing the mean values of each cluster on the clustering variables. The table of means for the data examined in this article is shown below.

A second output shows which object has been classified into which cluster, as shown below. Other outputs include plots and diagnostics designed to assess how much variation exists within and between clusters.

Cluster analysis algorithms

Cluster analysis is a computationally hard problem. For most real-world problems, computers are not able to examine all the possible ways in which objects can be grouped into clusters. Thousands of algorithms have been developed that attempt to provide approximate solutions to the problem. The three main ones are:

Hierarchical clustering. This technique starts by treating each object as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clusters that are closest together, and (2) merge the two most similar clusters. This continues until all the clusters are merged together.

Try your own Hierarchical Cluster Analysis

k-means cluster analysis. This technique requires the user to specify a required number of clusters. Initially, observations are allocated to clusters using some arbitrary process (e.g., randomly). Then, the cluster means are computed, and objects are allocated to the closest cluster. These last two steps are repeated until the clusters do not change.
Latent class analysis. In terms of process, this is like k-means, except that it can be used with both numeric and non-numeric data.

Try your own Latent Class Analysis

What is Hierarchical Clustering?

Tim Bock — Wed, 28 Mar 2018 01:05:10 +0000

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

If you want to do your own hierarchical cluster analysis, use the template below - just add your data!

Required data

Hierarchical clustering can be performed with either a distance matrix or raw data. When raw data is provided, the software will automatically compute a distance matrix in the background. The distance matrix below shows the distance between six objects.

Create your own hierarchical cluster analysis

How hierarchical clustering works

Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clusters that are closest together, and (2) merge the two most similar clusters. This iterative process continues until all the clusters are merged together. This is illustrated in the diagrams below.

The main output of Hierarchical Clustering is a dendrogram, which shows the hierarchical relationship between the clusters:

Create your own hierarchical cluster analysis

Measures of distance (similarity)

In the example above, the distance between two clusters has been computed based on the length of the straight line drawn from one cluster to another. This is commonly referred to as the Euclidean distance. Many other distance metrics have been developed.

The choice of distance metric should be made based on theoretical concerns from the domain of study. That is, a distance metric needs to define similarity in a way that is sensible for the field of study. For example, if clustering crime sites in a city, city block distance may be appropriate. Or, better yet, the time taken to travel between each location. Where there is no theoretical justification for an alternative, the Euclidean should generally be preferred, as it is usually the appropriate measure of distance in the physical world.

Create your own hierarchical cluster analysis

Linkage Criteria

After selecting a distance metric, it is necessary to determine from where distance is computed. For example, it can be computed between the two most similar parts of a cluster (single-linkage), the two least similar bits of a cluster (complete-linkage), the center of the clusters (mean or average-linkage), or some other criterion. Many linkage criteria have been developed.

As with distance metrics, the choice of linkage criteria should be made based on theoretical considerations from the domain of application. A key theoretical issue is what causes variation. For example, in archeology, we expect variation to occur through innovation and natural resources, so working out if two groups of artifacts are similar may make sense based on identifying the most similar members of the cluster.

Where there are no clear theoretical justifications for the choice of linkage criteria, Ward’s method is the sensible default. This method works out which observations to group based on reducing the sum of squared distances of each observation from the average observation in a cluster. This is often appropriate as this concept of distance matches the standard assumptions of how to compute differences between groups in statistics (e.g., ANOVA, MANOVA).

Create your own hierarchical cluster analysis

Agglomerative versus divisive algorithms

Hierarchical clustering typically works by sequentially merging similar clusters, as shown above. This is known as agglomerative hierarchical clustering. In theory, it can also be done by initially grouping all the observations into one cluster, and then successively splitting these clusters. This is known as divisive hierarchical clustering. Divisive clustering is rarely done in practice.

You can quickly create your own hierarchical cluster analysis in Displayr. Sign up below to get started.

What is k-Means Cluster Analysis?

Tim Bock — Wed, 28 Mar 2018 00:00:27 +0000

Download your free DIY Market Segmentation ebook

The required data for k-means cluster analysis

k-means cluster analysis is performed on a table of raw data, where each row represents an object and the columns represent quantitative characteristics of the objects. These quantitative characteristics are called clustering variables. For example, in the table below there are 18 objects, and there are two clustering variables, x, and y. In a real-world application, there will typically be many more objects and more variables. For example, in market segmentation, where k-means is used to find groups of consumers with similar needs, each object is a person and each variable is commonly a rating of how important various things are to consumers (e.g., quality, price, customer service, convenience).

How k-means cluster analysis works

Step 1: Specify the number of clusters (k). The first step in k-means is to specify the number of clusters, which is referred to as k. Traditionally researchers will conduct k-means multiple times, exploring different numbers of clusters (e.g., from 2 through 10).

Step 2: Allocate objects to clusters. The most straightforward approach is to randomly assign objects to clusters, but there are many other approaches (e.g., using hierarchical clustering). The 18 objects have been represented by dots on a scatterplot, as seen in the diagram below, where x is shown by the horizontal position of each object and y by the vertical. The objects have been randomly assigned to the two clusters (k = 2), where one cluster is shown with filled dots and the other with unfilled dots.

Step 3: Compute cluster means. For each cluster, the average value is computed for each of the variables. In the plot below, the average value of the filled dots for the variable represented by the horizontal position (x) of the dots is around 15; for the variable on the vertical dimension, it is around twelve. These two means are represented by the filled cross. Or, stated slightly differently: the filled cross is in the middle of the black dots. Similarly, the white cross is in the middle of the white dots. These crosses are variously referred to as the cluster centers, cluster means, and cluster medoids.

Step 4: Allocate each observation to the closest cluster center. In the plot above, some of the filled dots are closer to the white cross and some of the white dots are closer to the black cross. When we reallocate the observations to the closest clusters we get the plot below.

Step 5: Repeat steps 3 and 4 until the solution converges. Looking at the plot above, we can see that the crosses (the cluster means) are no longer accurate. The following plot shows that they have been recomputed using step 3. In this example, the cluster analysis has converged (i.e., reallocating observations and updating means cannot improve the solution). When you have more data, more iterations are typically required (i.e., steps 3 and 4 are repeated until no respondents change clusters).

The algorithm described above is known as the batch algorithm. Many other variants of k-means have been developed. Perhaps the most popular of these moves objects to a cluster one at a time, updating the mean each time.

Download your free DIY Market Segmentation ebook

The outputs from k-means cluster analysis

The main output from k-means cluster analysis is a table showing the mean values of each cluster on the clustering variables. The table of means produced from examining the data is shown below:

Download your free DIY Market Segmentation ebook

What is a Dendrogram?

Tim Bock — Wed, 21 Mar 2018 05:57:15 +0000

A dendrogram is a diagram that shows the hierarchical relationship between objects. It is most commonly created as an output from hierarchical clustering. The main use of a dendrogram is to work out the best way to allocate objects to clusters. The dendrogram below shows the hierarchical clustering of six observations shown on the scatterplot to the left. (Dendrogram is often miswritten as dendogram.)

Create your own hierarchical cluster analysis

To create your own dendrogram using hierarchical clustering, simply click the button above!

How to read a dendrogram

The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.

In the dendrogram above, the height of the dendrogram indicates the order in which the clusters were joined. A more informative dendrogram can be created where the heights reflect the distance between the clusters as is shown below. In this case, the dendrogram shows us that the big difference between clusters is between the cluster of A and B versus that of C, D, E, and F.

It is important to appreciate that the dendrogram is a summary of the distance matrix, and, as occurs with most summaries, information is lost. For example, the dendrogram suggests that C and D are much closer to each other than is C to B, but the original data (shown in the scatterplot), shows us that this is not true. To use some jargon, a dendrogram is only accurate when data satisfies the ultrametric tree inequality, and this is unlikely for any real-world data.

The consequence of the information loss is that the dendrograms are most accurate at the bottom, showing which items are very similar.

Create your own hierarchical cluster analysis

Allocating observations to clusters

Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters. In the example below, we have two clusters. One cluster combines A and B, and a second cluster combining C, D, E, and F.

Create your own hierarchical cluster analysis

Dendrograms cannot tell you how many clusters you should have

A common mistake people make when reading dendrograms is to assume that the shape of the dendrogram gives a clue as to how many clusters exist. In the example above, the (incorrect) interpretation is that the dendrogram shows that there are two clusters, as the distance between the clusters (the vertical segments of the dendrogram) are highest between two and three clusters.

Interpretation of this kind is justified only when the ultrametric tree inequality holds, which, as mentioned above, is very rare. In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data. Where there is an obviously “correct” number of clusters this will often be evident in a dendrogram. However, dendrograms often suggest a correct number of clusters when there is no real evidence to support the conclusion.

We hope you're now an expert in dendrograms!

To create your own dendrogram quickly in Displayr as an output of hierarchical cluster analysis, get started below.

Create your own hierarchical cluster analysis

5 Ways to Deal with Missing Data in Cluster Analysis

Tim Bock — Tue, 09 May 2017 20:30:30 +0000

How to deal with missing data in cluster analysis

If you are not sure about cluster analysis or want to refresh your memory, check out our post on "What is cluster analysis" first. In this post I explain and compare the five main options for dealing with missing data when using cluster analysis:

Complete case analysis
Complete case analysis followed by nearest-neighbor assignment for partial data
Partial data cluster analysis
Replacing missing values or incomplete data with means
Imputation

Explore the 5 ways to deal with missing data in a cluster analysis

Missing data in cluster analysis example

1,145 market research consultants were asked to rate, on a scale of 1 to 5, how important they believe their clients regard statements like Length of experience/time in business and Uses sophisticated research technology/strategies. Each consultant only rated 12 statements selected randomly from a bank of 25. Thus, each respondent has 13 missing values. Read on to discover the five ways of dealing with missing data in cluster analysis.

Complete case analysis

Performing clustering using only data that has no missing data forms the basic underlying idea of complete case analysis. In my example, no such data exists. Because each consultant has 13 missing values, the cluster analysis fails.

Even if some data is available, complete case analysis is a pretty dangerous approach. It is only valid when the cases with missing data have essentially the same characteristics as the cases with complete data. The formal statistical jargon for this is that complete case analysis assumes that the data is Missing Completely At Random, MCAR. This assumption is, problematically, virtually never true. The presence of missing data provides a clue that the cases with missing data are in some way different.

The problem with the MCAR assumption is easy to spot with a simple example. Look at the data to the right, which shows data for 8 people measured on 4 variables. How many clusters can you see? You can probably see 3. The first two rows represent the first cluster. Rows 3 and 4 represent the second cluster. Rows 5 through 8, which represent 50% of the respondents, make up the third cluster. However, if you only use the complete cases - that is, rows 1 through 4 - you can only ever find the first 2 clusters. The third cluster is entirely missing from the data used in complete case analysis.

Dealing with missing data in cluster analysis is almost a nightmare in SPSS. Returning to our case study, where we have no complete cases, if we run it using the default options in SPSS's K-means cluster we get the following error: Not enough cases to perform the cluster analysis. R's kmeans gives essentially the same message, but worded in a way that seems designed to inflict pain on the user: NA/NaN/Inf in foreign function call (arg1).

Explore the 5 ways to deal with missing data in a cluster analysis

Complete case analysis followed by nearest-neighbor assignment for partial data

A common way of addressing missing values in cluster analysis is to perform the analysis based on the complete cases, and then assign observations to the closest cluster based on the available data. For example, this is done in SPSS when running K-means cluster with Options > Missing Values > Exclude case pairwise.

This is something of a fig leaf, which is to say that it solves nothing, but the problem gets hidden. Looking at the simple example above, the outcome identifying only two clusters remains. But, respondents represented by rows 5 to 8 will get assigned to one of these clusters (SPSS assigns rows 5 and 7 to the first cluster, and 6 and 8 to the second cluster).

When this method is used in our case study data, we get an error, as none of the respondents have complete data, so the cluster analysis cannot be performed.

Partial data cluster analysis

I have reproduced the simple example data set from above. If you can see the three clusters that I described earlier in my post, you have understood the essence of using partial data when clustering. The idea is as simple as this: group people together based on the data that they have in common. When we do this, we can see that the rows 5 through 8 are identical, except for the unknown missing data.

We can form clusters if we take this approach to our case study. This is a big improvement on the complete case approach. The partial data k-means algorithm that I have used here is one that I have written and made available in an R package on GitHub called flipCluster. By all means you can use it for cluster analysis in R, however, the simplest way to use it is from the menus in Displayr (Insert > More > Segments > K-Means Cluster Analysis).

If you want to play around with the data in the case study in Displayr, click here.

Explore the 5 ways to deal with missing data in a cluster analysis

Replacing missing values with means

A common hack for dealing with missing data is to replace missing values with the mean value of that variable. In the example below, there are two missing values for variable A and 2 for variable C. Each of these variables has an average of 8 (based on those respondents with no missing data for the variable), so we replace the missing values with values of 8.

To see the problem with mean imputation, focus on the third row of data. What amount of similarity exists between the respondent and those represented by the other rows? We can only compare based on variable B, as the respondent has no other data. Based on variable B, we would say that this respondents is identical to respondents 1 and 2. The respondent is also clearly different to respondents 4, 5, and 6. Thus, if clustering using partial data, we would end up with respondents 1, 2, and 3, grouped together, and respondents 4 through 6 in another segment.

However, we reach different conclusions when we look at the data where the means replace the missing values. By any sensible criteria, respondent 3 is now more similar to respondents 4, 5, and 6, than to respondents 1 and 2. So, the mean imputation is fundamentally changing the structure of the underlying data. The consequence of this is that when means replace missing values, the final clusters we obtain are, to an some extent, a consequence of the decision to replace the missing values by the means, rather than the data itself.

Imputation

Imputation involves replacing the missing values with sensible estimates of these values. For example, looking at the example above, it may be sensible to replace respondent 2's missing value for variable A with a value of 2. This is sensible because all the information available suggests that respondent 1 and 2 are identical (i.e., they only have one variable in common, B, and both respondents have a 6 for that variable). This is the same logic that underlies using partial data cluster analysis.

It is possible to impute an even better value. The imputation can include variables not used in the cluster analysis. These other variables may be strongly correlated with variable A, allowing us to obtain a superior imputed value. Shrinkage estimators can also be used to reduce the effects of sampling error. For example, while it is true that respondents 1 and 2 are identical based on the only data available, leading to the conclusion that 2 may be a sensible value to impute, it is also true that 3 out of 4 people to have data on variable A have a value of 10. This implies that 10 may also be a sensible value to impute. Thus, a better estimate may be one that is somewhere between 2 and 10 (or, to be even more precise, between 2 and the average of 8).

The ability to incorporate additional variables combined with using shrinkage estimators means that imputation can outperform partial data cluster analysis. This is because it can take into account relevant information that is ignored by the partial cluster analysis. However, a few things are required in order for this to occur:

Use a modern imputation algorithm. In practice, this usually means using one of the algorithms available in R, such as those in the mice and mi packages. Most data science and statistics apps have integrations with R (e.g., Displayr, Q, SPSS). Older imputation algorithms will generally perform worse than using partial cluster analysis. This is because the older algorithms can be quite poor (e.g., hot decking), or, make theoretical assumptions that can add considerable noise to the clustering process (e.g. latent class algorithms and algorithms that assume a normal prior).
The imputation needs to have been done skillfully. It is easy to make mistakes when imputing data.

Consequently, if using imputation, it is usually a good idea to also use partial data cluster analysis and compare and contrast the results (e.g., checking how well they correlate with other data).

Multiple imputation

The best-practice approach for imputation is called multiple imputation. Because you cannot be sure which value is the best to use when imputing, in multiple imputation, you instead work out a range of possible values. You then create multiple data files with different imputed values in each. Some statistical methods can be adjusted to analyze these multiple data files (e.g., regression). However, it does not make sense to use this approach with cluster analysis, as the result would be multiple different cluster analysis solutions, and no rigorous way to combine them.

Explore the 5 ways to deal with missing data in a cluster analysis

Summary

Of the five methods considered, three can be outright rejected as being inferior: complete case analysis, complete case analysis followed by nearest-neighbor assignment for partial data, and replacing missing values with means. This is not to say that these methods are always invalid. Ultimately, a cluster analysis solution needs to be judged based on its fitness for a particular problem.

It is therefore possible that these inferior methods can achieve an appropriate solution given the nature of the problem. Having said that, the nature of the inferior methods ultimately cause misrepresentation of the data. This means that, in general, using either partial data cluster analysis or imputation is advisable. If I had little time, I would always choose partial data cluster analysis over imputation. It is both simpler and safer. With sufficient time, I would generally investigate both.

The case study in this Displayr document illustrates the five methods for dealing with missing data in cluster analysis.

Acknowledgements

The data in this case study was provided by GreenBook (GRIT2012/2013).

Assigning Respondents to Clusters/Segments in New Data Files in Displayr

Tim Bock — Wed, 29 Mar 2017 20:34:37 +0000

Once you have created segments or clusters, it is often useful to assign people in other data sets to the segments (this is also known as segment tagging and scoring). For example, you may want to tag a customer database with predicted segment memberships. Or, you may want to assign respondents in a tracker to segments. When doing this, there are two basic approaches:

You can assign people to segments in the new data file using the same variables as used when forming the segments, or,
You can predict segment membership based on a different set of variables.

Before proceeding with any of these approaches, it is a good idea to take a copy of your project and make your changes in the copy.

The basic principle underlying all of these approaches is that you create a model in one data set, and then import a revised data set, but making sure that the model does not update to reflect the new data. Then, you use the existing model to make predictions in the new data set with the new variables as inputs.

Assigning people to segments in the new data file using the same variables

The best way to do this depends on whether we have used latent class analysis (Insert > Groups/Segments (Analysis) or k-means cluster analysis (Insert > More (Analysis) > Segment > K-Means Cluster Analysis).

Segments formed using latent class analysis

A three-segment latent class solution is shown below. This has been based on a sample size of 400. To allocate people in a new data file using these segments:

Click on the data set in the Data Tree.
Press Update in the Object Inspector and select the new data file. You will see some warnings. Ignore them (i.e., do not follow the suggestion about modifying the segments, as this will re-run the segments on a new data file).
The Groups/Segments ... variable, which is in the Data Tree, has now automatically been updated, allocating people in the new data file to the segments.

Segments formed using k-means

A three-cluster k-means solution is shown above. To allocate people in a new data file using these segments:

Click on the k-means solution and make sure that Automatic is not checked (this option is in Inputs > R Code in the Object Inspector).
Take a copy of line 2 of the code. In my example, it looks like this:

 
kmeans = KMeans(data.frame(understand, shop, key, value, interested),

Click on the data set in the Data Tree.
Press Update in the Object Inspector and select the new data file.
From the Ribbon, select Insert > R (Variables) > Numeric Variable.
In the R Code box in the Object Inspector, paste in the copied code, and modify it so that it looks like this (the key bits to retain from your pasted code are kmeans or whatever it has been changed to and the variable names):

 
predict(kmeans, newdata = data.frame(understand, shop, key, value, interested))

Give the variable an appropriate Name and Label.
Change the Structure of the variable to Mutually exclusive categories (Nominal) (this setting is found in the Object Inspector under Properties > Inputs).
Press Labels (below DATA VALUES) and enter any labels you desire and press OK.

Predict segment membership using a different set of variables

In this scenario, segments have been formed and then a predictive model is used to predict segment membership on either:

A completely different set of variables (e.g., demographics, or some other data available in a customer database).
A subset of the variables used to create the segments. (Tip: if you are building a predictive model based on exactly the same variables as used to create segments, you are making a mistake, and should instead use the approach described in the previous section).

The output above from a multinomial logit (MNL) model (Insert > More (Analysis) > Regression > Multinomial Logit), predicting segment membership based on firmographics. The goal is to now predict segment membership in a new data file, that contains the same predictor variables.

Click on the model output and make sure that Automatic is not checked (this option is in Inputs > R Code in the Object Inspector).
Take a copy of the line of code that looks similar to this (with different variable names):

 
glm = Regression(segmentsGXVYS ~ q1 + q2 + q3 + q4 + q5,

Click on the data set in the Data Tree.
Press Update in the Object Inspector and select the new data file.
Form the Ribbon, select Insert > R (Variables) > Numeric Variable.
In the R Code box in the Object Inspector, paste in the copied code, and modify it so that it looks like this (the key bits to retain from your pasted code are glm or whatever it has been changed to and the variable names):

 
predict(glm, newdata = data.frame(q1, q2, q3, q4, q5))

Give the variable an appropriate Name and Label.
Change the Structure of the variable to Mutually exclusive categories (Nominal) (this setting is found in the Object Inspector under Properties > Inputs).
Press Labels (below DATA VALUES) and enter any labels you desire and press OK.