Machine Learning - Displayr

Identifying Drivers of Customer Satisfaction

Carmen Chan — Tue, 18 Dec 2018 16:59:46 +0000

It's one thing to know how satisfied your customers are, it's quite another to understand why. Customer satisfaction driver analysis aims to uncover the factors that influence -- or drive -- satisfaction. A customer feedback survey should ask respondents for their level of satisfaction with various features or aspects of your product or service, not just their overall satisfaction. With this information, you can identify the key drivers of customer satisfaction.

Relative importance analysis

Looking at the satisfaction scores in your survey, your first instinct may be to apply a linear regression. However, linear regression is not reliable for survey data with correlations. Instead, it is better to apply a modified form of regression using relative weights that are designed to account for correlation between variables. Similar to Shapley regression, relative weights determines what proportion of R-squared from a linear regression model can be attributed to each independent variable.

Below, we show the relative importance output on a data set of the satisfaction of bank customers.

The results above show that the most important attribute is Branch service, which accounts for 32% of the R-squared. If instead we had used linear regression (below), we would still have identified Branch service as the most important. However, we would have mistakenly thought that Fees, Interest rates and Phone service had similar levels of importance. The relative importance analysis, however, shows that Branch service is more than twice as important as Interest rates.

Tutorial: CSAT Driver Analysis in Displayr

Correlation matrix

It can be helpful to look at the correlation between the outcome and predictor variables. Confirming the results of the relative importance analysis, we see that Branch service is the most strongly correlated with Overall Satisfaction. Additionally, the correlation between the predictor variables is low. This explains why the results of the relative importance analysis and linear regression do not differ dramatically. If you find strongly correlated variables in your data set, you may want to remove some of them.

Scatterplot of raw data

Another way to check your data is to create a scatterplot. The example below uses small multiples to show all of the predictor variables. It is not the easiest chart to read because of the overlapping points. But it is useful to check that there are no unusual clusterings for any of the variables.

Importing your data

Displayr can import a variety of file types, including .sav (SPSS), Excel, or CSS files. If you are using SurveyMonkey, it is easiest to export your data as a .sav file, which will include metadata about the variables. Alternatively, save your data as an Excel or .csv file. Below, we show a snippet of the .csv file used for the analysis above.

After importing the data into Displayr, make sure that the data is set up properly. In particular, the overall satisfaction and attribute satisfaction scores should be numeric variables. If you look at the Data Sets tab in the bottom-left of the window it should have a numeral as the icon.

If the icon next to the variables looks different, you can change the structure of the variable by clicking on it in the Data Sets tab. You will then see properties of the variable show up on the Object Inspector on the right of the screen. You can then change the structure to Numeric or Numeric - Multi.

Publishing your dashboard

Once your data is all hooked up to the analyses and visualizations in the template, it's time to publish your dashboard! To publish your dashboard as a web page, go to Export in the ribbon and click Web Page. This will create your published dashboard. You can share this link with anyone you like, and they can navigate through the dashboard. The instructions in the template will be hidden in the published version. If you need to go back and change anything, you can just click Embed > Update All and your published dashboard will update.

What’s in the Future for Predictive Lead Scoring?

Lucy Li — Wed, 28 Nov 2018 04:41:12 +0000

If you haven't read our "What is Predictive Lead Scoring?" article yet, make sure you check it out!

What has predictive lead scoring has done is improve on traditional methods of lead scoring? By reducing the role human judgment plays in prioritizing leads, predictive lead scoring was always designed to improve the accuracy and efficiency of these traditional methods. Furthermore, it's supplied a data-driven reason for decisions made by marketing and sales teams, helping to align these two teams and improving the efficiency of lead prioritization systems. While not radically revolutionizing the face of lead scoring, predictive lead scoring has done what it’s promised.

Predictive Lead Scoring is Improving

Make no mistake though, predictive lead scoring is still the way of the future. Advances in technology will result in more streamlined systems being used. Machine learning algorithms that determine scores for leads will continue to evolve providing increasing value. For example, identifying with increased accuracy those qualified leads likely to convert. The next generation of lead scoring will be fuelled by APIs, more data, significant training, and constantly fine-tuned algorithms. Predictive lead scoring uses different models like logistic regression and random forests to determine the best match. This data will power models that are used in conjunction with the supplied data to produce more accurate results.

In addition, predictive lead scoring models are always improving, and the potential upside is large. Drawbacks and restrictions to those being able to conduct predictive lead scoring are disappearing slowly. Early predictive lead scoring models required big datasets and thousands of contacts with clean and accurate associated data in order to make predictions. Many businesses lack either the quantity or the specific quality of data necessary to train an accurate model. Newer models can pull more data from other third-party sources and integrate it with the data from a business’ CRM platform to supplement the data.

As the machine learning field evolves, naturally so too will the technologies for predictive lead scoring. For example, developments in neural networks are already able to provide a more accurate score. Neural networks can intelligently aggregate data from various sources at the same time. With the scale of your predictive lead scoring being able to grow and develop simultaneous with the growth of your business, it makes sense to start investing in it for the future.

What is Predictive Lead Scoring?

Lucy Li — Thu, 22 Nov 2018 05:07:00 +0000

To understand predictive lead scoring, we first have to understand traditional lead scoring so let’s rewind a little bit.

What is Traditional Lead Scoring?

Lead scoring is the process of assigning scores to prospects and potential customers based on relevant data. Relevant data could include demographic information. For example, the area of work, role title, etc. Other data can include frequency of online engagement or viewing particular website pages that signal an interest in purchasing. In traditional lead scoring, marketers rank the significance of certain actions to gauge customer intent and qualify leads. For example, a visitor that finds the homepage through an organic search and fills out a form or subscribes for more information will most likely receive a higher score than someone who opened an email or read a single blog post before bouncing.

Marketers and salespeople rank potential customers against a scale that represents how likely that lead is to convert and its perceived value. The resulting score determines which leads markets will assign the highest priority to be contacted by a sales representative. In an ideal world, leads are scored accurately and marketers can pass the most valuable new leads onto the sales team along with some handy information about them.

Unfortunately, we don’t live an ideal world and leads are not always scored accurately. Marketers often depend on their own subjective judgment or past data patterns to evaluate and weigh actions they deem relevant to making a sale. Opportunities may slip through the cracks or sales teams may spend too much time chasing under-qualified or ill-suited leads based on inaccurate scores.

What about Predictive Lead Scoring?

Here’s where predictive lead scoring can come to the rescue. Predictive lead scoring takes out or reduces the element of human error and increases the accuracy of identifying quality leads. Predictive lead scoring uses predictive modeling, a common statistical technique used to predict future behavior based on past behavior. Advanced predictive modeling algorithms combine historical and current data to generate a model predicting future outcomes. Linked CRM and marketing automation solutions provide internal data for these algorithms.

Predictive modeling algorithms pull in all this data. They analyze successful and unsuccessful leads in order to find patterns in the data. It is these patterns that identify factors that are most relevant and useful in predicting sales. Predictive lead scoring may be able to come up with an ideal profile of a customer that is most likely to buy based on this combination of historical demographic and activity data and therefore be able to identify the warmest leads. It can also help to identify patterns or relationships in the data that were previously missed. Beyond just reducing the margin of human error in lead scoring, predictive lead scoring helps the marketing and sales team align with data-driven lead scoring qualifications.

Check out "What's in the Future for Predictive Lead Scoring?" next.

Decision Trees Are Usually Better Than Logistic Regression

Tim Bock — Wed, 24 Oct 2018 17:00:53 +0000

If you've studied a bit of statistics or machine learning, there is a good chance you have come across logistic regression (aka binary logit). It is the old-school standard approach to building a model where the goal is to predict an outcome with two categories (e.g., Buy vs Not Buy). If you are a good statistician with a lot of time on your hands it is a great technique. But for everybody else, it has been superseded by various machine learning techniques, with great names like random forest, gradient boosting, and deep learning, to name a few. In this post I focus on the simplest of the machine learning algorithms - decision trees - and explain why they are generally superior to logistic regression. I will illustrate using CART, the simplest of the decision trees, but the basic argument applies to all of the widely used decision tree algorithms.

Create your own CART decision tree

Logistic regression's big problem: difficulty of interpretation

The main challenge of logistic regression is that it is difficult to correctly interpret the results. In this post I describe why decision trees are often superior to logistic regression, but I should stress that I am not saying they are necessarily statistically superior. All I am saying is that they are better because they are easier and safer to use. Even the most experienced statistician cannot look at the table of outputs shown below and quickly make precise predictions about what causes churn. By contrast, a decision tree is much easier to interpret.

Decision trees: the easier-to-interpret alternative

The decision tree below is based on an IBM data set which contains data on whether or not telco customers churned (canceled their subscriptions), and a host of other data about those customers. The decision tree shows how the other data predicts whether or not customers churned. This is an interactive visualization that allows you to hover, zoom, and collapse things by clicking on them (best viewed on a desktop).

The way to read it is as follows:

The single best predictor of churn is contract length. We know this because it appears on the far left.
People with a month-to-month contract are different from those with a one or two year contract. The type of decision tree I have used (CART) always splits into two categories. Because one and two years have been combined we know that the difference between these two groups is less than their difference to month-to-month. It does not necessarily mean that there is no difference between one and two year contract people in terms of their propensity to churn. The decision tree could, if the data warranted, split people further on in terms of one and two year contracts.
People with a one or two year contract are less likely to churn than those with a month-to-month contract. We can see this by the color shading, where bluer means more likely to churn and redder means less likely to churn. If you hover your mouse over the nodes, which are the grey vertical rectangles, you can see the underlying data, as shown to the right, which tells us that people on a one or two year contract have only a 7% chance of churning.
There are more people on a month-to-month contract than are on a one or two year contract. We know this because the corresponding "branch" of the tree is thicker. We can also see the number of people by hovering over the node.
If we know somebody is on a one or two year contract, that is all we need to know. The predictions of the model do not require splitting this branch further.
Among the people on a one month contract, the best predictor is their internet service, with people on a fiber optic service being much more likely to churn (again, we can see this both by the blueness of the branch, and if we hover over the node).
Among people with a month-to-month contract who have a fiber optic connection, if their tenure is 15 months or less, they are likely to churn (69%), whereas those on the fiber optic plan with a longer tenure are less likely to churn.

In this manner we can continue explaining each branch of the tree.

Decision trees are safer

The problem of logistic regression being hard to interpret is much more serious than it first appears. As most people are not able to interpret it correctly, they end up not even noticing when they have stuffed it up, leading to a double boo-boo, whereby they inadvertently create a model that is rubbish, which they then go on to misinterpret. Am I talking about you? Are you using feature engineering to ensure that the linear model isn't a problem? Did you use an appropriate form of imputation to address missing data? Are you controlling your family-wise error rate or using regularization to address forking paths? How are you detecting outliers? Are you looking at your G-VIFs to investigate multicollinearity? If you are reading this and thinking "what?", then the options are to go back to graduate school and invest in some stats learning, or say goodbye to logistic regression and replace them with decision trees.

The great thing about decision trees is that they are as simple as they appear. No advanced statistical knowledge is required in order to use them or interpret them correctly. Yes, sure, there are ways you can improve them if you are an expert, but all that is really required to be successful when you use them is common sense.

Decision trees predict well

With the data set used in this example I performed a test of predictive accuracy of a standard logistic regression (without taking the time to optimize it by feature engineering) versus the decision tree. When I performed the test I used a sample of 4,930 observations to create the two models, saving a further 2,113 observations to check the accuracy of the models. The models predicted essentially identically (the logistic regression was 80.65% and the decision tree was 80.63%). My experience is that this is the norm. Yes, some data sets do better with one and some with the other, so you always have the option of comparing the two models. However, given that the decision tree is safe and easy to understand, this means that, to my mind, it is always the safer alternative.

However, if your focus is solely on predictive accuracy, you are better off using a more sophisticated machine learning technique, such as random forests or deep learning.

So why, then, are logistic regressions better known than decision trees?

In addition to the benefit of being a lot older, logistic regression is, if you have a lot of time and expertise, pretty cool and does some things a lot better than a decision tree. Consider, for example, the role of tenure shown below. The decision tree tells us that if somebody is on a month-to-month contract, with DSL or no internet service, the next best predictor is tenure, with people with a tenure of 6 months or more having an 18% chance of churning, compared to a 42% chance for people with a tenure of less than 6 months. As far as predictions go, this is a bit blunt. It seems unlikely that 6 months is the magical cutoff. A more likely explanation is that the real situation is that the likelihood of churn drops a little for every additional month of tenure. Decision trees simplify such relationships. A logistic regression can, with appropriate feature engineering, better account for such a relationship.

A second limitation of a decision tree is that it is very expensive in terms of sample size. Each time it splits the data using a predictor, the remaining sample size reduces, and eventually gets to a stage where there is not enough data to identify further predictors. However, it is likely that some of these further predictors are still relevant. By contrast, logistic regression looks at the simultaneous effects of all the predictors, so can perform much better with a small sample size. The flip side of this is that often effects are sequential rather than simultaneous, in which case decision trees are much better. The decision tree shown in this post is a good example of a case where such a sequential relationship likely does make more sense; if somebody is on a contract they are locked in and other predictors are likely not relevant (and would incorrectly be assumed to be relevant if applying typical logistic regression).

Another weakness of decision trees is that they have their own potential for misinterpretation, with many people incorrectly assuming that the order with which predictors appear in a tree tells you something about their importance. Unfortunately, this is often not the case. For example, if you have two highly correlated predictors, only one of them may appear in the tree and which one it is will be a bit of a fluke.

The consequence of all of these strengths of logistic regression is that if you are doing an academic study and wanting to make conclusions about what causes what, logistic regression is often much better than a decision tree. However, if instead the goal is to either make a prediction, or describe the data, then logistic regression is often a poor choice.

Create your own CART decision tree

Technical details

There are lots of different algorithms for creating decision trees. In this post I have used a classification tree, created in Displayr using Insert > Machine Learning > Classification And Regression Trees (CART). There are also a number of different ways of showing decision trees; in this post I am showing the decision tree as a sankey diagram, which is I think the best way (but is not the most common way) of showing decision trees.

When creating a decision tree, you will need to determine how big the tree should be. If the goal of the logistic regression is predictive accuracy, it is usually advisable to create the tree that maximizes predictive accuracy based on cross-validation. In Displayr, this is is achieved by setting Pruning to Minimum error (which is the default).

If the goal when creating a decision tree is to describe the data, focusing more on what has happened in the past than on predicting what will happen in the future, it can be useful to either:

Create a smaller tree, if the one that maximizes predictive accuracy is too big. One way to do this is set Pruning to Smallest tree, which finds a relatively small tree with relatively good predictive accuracy
Create a bigger tree, if the one that maximizes predictive accuracy is too small. This can be done by setting Pruning to None. Keep in mind if selecting this option that there is a good chance that some of the relationships that appear in the smaller branches will be flukes.

Have we convinced you? Create your own decision tree here!

What is Feature Engineering?

Tim Bock — Wed, 24 Oct 2018 17:00:28 +0000

The goal of feature engineering

The data used to create a predictive model consists of an outcome variable, which contains data that needs to be predicted, and a series of predictor variables that contain data believed to be predictive of the outcome variable. For example, in a model predicting property prices, the data showing the actual prices is the outcome variable. The data showing things, such as the size of the house, number of bedrooms, and location, are the predictor variables. These are believed to determine the value of the property.

A "feature" in the context of predictive modeling is just another name for a predictor variable. Feature engineering is the general term for creating and manipulating predictors so that a good predictive model can be created.

Feature creation

The first step in feature engineering is to identify all the relevant predictor variables to be included in the model. Identifying these variables is a theoretical rather than practical exercise and can be achieved by consulting the relevant literature, talking to experts about the area, and brainstorming.

A common mistake people make when they start predictive modeling is to focus on data already available. Instead, they should be considering what data is required. This mistake often leads to two practical problems:

Essential predictor variables end up being left out of the model. For example, in a model predicting property prices, knowledge of the type of property (e.g., house, apartment, condo, retail, office, industrial) is crucially important. If this data is not available, it needs to be sourced well before any attempt is made at building a predictive model.
Variables that should be created from available data are not created. For example, a good predictor of many health outcomes is the Body Mass Index (BMI). To calculate BMI, you have to divide a person's weight by the square of their height. To build a good predictive model of health outcomes you need to know enough to work out that you need to create this variable as a feature for your model. If you just include height and weight in the model, the resulting model will likely perform worse than a model that includes BMI, height, and weight as predictors, along with other relevant variables (e.g., diet, a ratio of waist to hip circumference).

Transformations

Feature transformation involves manipulating a predictor variable in some way so as to improve its performance in the predictive model. A variety of considerations come into play when transforming models, including:

The flexibility of machine learning and statistical models in dealing with different types of data. For example, some techniques require that the input data be in numeric format, whereas others can deal with other formats, such as categorical, text, or dates.
Ease of interpretation. A predictive model where all the predictors are on the same scale (e.g., have a mean of 0 and a standard deviation of 1), can make interpretation easier.
Predictive accuracy. Some transformations of variables can improve the accuracy of prediction (e.g., rather than including a numeric variable as a predictor, instead include both it and a second variable that is its square).
Theory. For example, economic theory dictates that in many situations the natural logarithm of data representing prices and quantities should be used.
Computational error. Many algorithms are written in such a way that "large" numbers cause them to give the wrong result, where "large" may not be so large (e.g., more than 10 or less than -10).

Feature Engineering for Numeric Variables and Feature Engineering for Categorical Variables describe data transformation in more detail.

Feature extraction

Transformations involve creating a new variable by manipulating one variable in some way or another. Feature extraction involves creating variables by extracting them from some other data. For example, using:

Principal components analysis (PCA) to create a small number of predictor variables from a much larger number.
Orthogonal rotations of predictor variables to minimize the effect of them being highly correlated.
Cluster analysis to create a categorical variable from multiple numeric variables.
Text analytics to extract numeric variables, such as sentiment scores, from text data.
Edge detection algorithms to identify shapes in images.

Feature selection

Feature selection refers to the decision about which predictor variables should be included in a model. To a novice, it might seem obvious to include all the available features in the model. Then let the predictive model automatically select which ones are appropriate. Sadly, it is not so simple in reality. Sometimes the computer you are using will crash if you select all the possible predictor variables. Sometimes then algorithm being used may not have been designed to take all available variables. If you were to include all the possible features of a model, the model may end up identifying spurious relationships. Just like people, if you give a model a whole lot of data, they can often come up with predictions that seem to be accurate, but which are just coincidences.

Feature selection in practice involves a combination of common sense, theory, and testing the effectiveness of different combinations of features in a predictive model.

Want to know more? Find out more with our What Is guides!

Feature Engineering in Displayr

Tim Bock — Wed, 24 Oct 2018 17:00:06 +0000

Feature engineering refers to the process of manipulating predictor variables (features) with the goal of improving a predictive model. In this post I outline some of the key tools and processes for feature engineering in Displayr.

Switching between categorical and numeric treatment of predictor variables

Perhaps the most fundamental form of feature engineering when building a predictive model is the decision about whether to treat a particular predictor as being categorical or numeric. In Displayr, the way that a variable is treated in a model is determined by its structure. Displayr has 15 different structures, but the two key ones of relevance in most predictive models are Numeric and Mutually exclusive categories (nominal), where Mutually exclusive categories (nominal) means that the data is treated as being categorical.

The structure of a variable is changed by selecting the variable in the Data Sets Tree (bottom-left), and changing Object Inspector > Properties > INPUTS > Structure. Sometimes a variable will be grouped into a variable set with other variables. It can be split by selecting Data Manipulation > Split.

Creating a new numeric variable

There are many tools in Displayr for creating new variables. The most flexible tool is to select Insert > New R (Variables) > Numeric Variable, which allows you to create a new variable using the R language. For example, to create a new variable which is the natural logarithm of an existing variable, called Tenure, type log(Tenure). See Feature Engineering for Numeric Variables for examples of the code to do things like winsorize, cap, normalize, and calculate polynomials.

Creating a new categorical variable

Categorical variables are created as follows:

Start by creating a numeric variable: Insert > New R (Variables) > Numeric Variable and enter code in the R CODE box.
Change the type to categorical with Object Inspector > Properties > INPUTS > Structure: Mutually exclusive categories (nominal).
Labels and values can be modified by clicking on the various options in Object Inspector > Properties > DATA VALUES.

Missing value settings

To modify which values of a variable are treated as missing, select the variable and then press Object Inspector > Properties > DATA VALUES > Missing values.

Merging categories of categorical variables

Categories of categorical variables can be merged by dragging and dropping. This is done by:

Dragging the variable from the Data Sets Tree onto the page. This will create a table.
Click on the table and then click on one of the categories you wish to merge. When three grey lines appear to the right, you can click on them and drag the category onto another category to merge them. Alternatively, you can use control or shift to select multiple categories and merge them using Data Manipulation > Merge (Rows/Columns).

Reordering categories of categorical variables

Categories can be reordered by clicking on them (see the previous selection), and dragging them.

Feature extraction

Displayr contains a large number of tools for feature extraction. For example:

Principal components analysis (PCA), for extracting dimensions from numeric variables: Insert > Dimension Reduction > Principal Components Analysis. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s), which will add the variables to the data set.
t-SNE, which is a highly nonlinear dimension reduction technique: Insert > Dimension Reduction > t-SNE. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s) > Components/Dimensions.
Multiple correspondence analysis, for extracting dimensions from categorical variables: Insert > Dimension Reduction > Multiple Correspondence Analysis. Once the analysis has been run, select it and then select the output and then click Insert > Dimension Reduction > Save Variable(s).
The various cluster analysis and latent class analysis tools in Insert > Group/Segment.

You can do anything...

Displayr supports all the main R packages, so it can perform any feature engineering that you require. If you cannot figure out how to do something, please contact us.

Ready to try this yourself? You can do this, and so much more, for free! Get started now!

Feature Engineering for Categorical Variables

Tim Bock — Tue, 23 Oct 2018 17:00:38 +0000

When creating a predictive model, there are two types of predictors (features): numeric variables, such as height and weight, and categorical variables, such as occupation and country. In this post I go through the main ways of transforming categorical variables when creating a predictive model (i.e., feature engineering categorical variables). For more information, also check out Feature Engineering for Numeric Variables.

In this post I work my way through a simple example, where the outcome variable is the amount of gross profit that a telco makes from each of a sample of 917 of its customers, and there are two predictors: industry and the turnover of the company. The goal of the predictive model is to identify which industries and company sizes to focus its efforts on. Tables showing the proportion of customers in each of the categories for the two features (aka predictor variables) are shown below.

Using the predictive model's in-built tools for categorical variables

The simplest approach to analyzing such data is to just select the predictor variables in whatever software you are using and let the software decide to automatically treat the data how it wants. If you are using well-written software this will often be an OK option, and will likely use one-hot encoding in the background.

The table below shows the r-squared statistics for both a linear regression and a random forest. The random forest does a much better job at predicting profit with the data used to fit the models (this is shown in the In-sample R2 column). This is not surprising, as this model is much more flexible, so can typically fit a specific data set better than a standard linear regression. However, in data not used to fit the model, this result reverses, with the regression performing better. In both cases the out-of-sample accuracy is much, much, worse. (See How to do Logistic Regression in Displayr for a discussion regarding the need for saving data for use in checking a model). This is a general issue with categorical predictors: the more categories, the more likely the models will over-fit (produce much better in-sample than out-of-sample fits).

Treating the predictors as numeric variables

An alternative to treating the data as categorical is to treat it as numeric. This entails assigning a value of 1 to the first category, 2 to the second category, and so on (e.g., 1 to Accommodation, Cafes & Restaurants, a 2 to Agriculture, Forestry & Fishing, etc.). The bottom two lines of the table below show the results of the models when the variables are treated as numeric. After this feature engineering, the linear regression essentially fails, with an r-squared of 0 in the data used to estimate the model and an error when computing the out-of-sample fit (the model produced worse-than-random predictions). The random forest does better, but the out-of-sample fit is substantially worse. We can learn from this that it is important to avoid unintentionally treating categorical predictors as numeric variables when creating predictive models (this may sound obvious, but it can be easy to do this inadvertently if you do not know to avoid the problem).

Judgmental encodings

There is an obvious ordering to the turnover categories below. An alternative way of analyzing the turnover data is to treat it as numeric, but use judgment to assign values to the data. For example, we could use consecutive integers, assigning a 1 to the lowest category, a 2 to the next lowest, and so on. Alternatively, we could assign midpoints or some other value. In the two tables below I have assigned values to each of the two predictor variables. In the case of the turnover data, the judgments seem reasonably sound to me (other than for Missing data), but for the industry data the encoding has little more than guesswork to justify it.

When coming up with your own encodings, there are a few rules:

The results of predictive models are not super-sensitive to the choices. For example, we could assign a value of 0, .05, or 1 to the Less than $100,000 category and it will make no difference, and even with the Over $10 million category, any of 10, 15, or 20 will likely not make a huge impact.
You should not look at relationship between the predictor variable and the outcome variable when forming the encoding. If you do, you will overfit your model (create a model that looks good initially but predicts poorly).

The results of the models with the judgmental encoding are shown below. Each of these models are relatively poor in-sample. This is to be expected, as the models with numeric predictor variables are less flexible. However, the out-of-sample performance of these two models is the best we have seen. It is pretty common to find this type of result: incorporating judgment into modeling tends to lead to better models.

Mixed judgment and categorical

The judgments regarding the encoding of turnover are relatively easy to defend. The ones made about industry are not. This suggests we should consider treating industry as a categorical variable, but use turnover with its judgmental encoding. The resulting two models are shown at the bottom of the table below. Looking at the in-sample r-squareds, we can see that the random forest has improved markedly on the training data, as now it can explore all possible combinations of industry and how they interact with turnover. However, its out-of-sample performance has become abysmal. By contrast, the best of the models is now our regression model, with the categorical industry and judgmentally-encoded turnover. This highlights an important practical issue with feature engineering: its effect varies depending on which predictive algorithm we use.

Merging categories

The massive degree of overfitting we are seeing with the random forest is because we have so many categories of industry, which leads to too many possible ways of combining industry together. We can limit the number of possible combinations by using judgment to merge some of the categories. I've used my own judgment to merge the categories as shown in the table below. For the same reasons described in the earlier section on judgmental encodings, it is important not to look at the outcome variable when working out which categories should be merged. By contrast, it is appropriate to look at the sizes of the categories, as categories with small samples tend to be unreliable and often benefit from being merged.

The in-sample r-squared of both the regression and the random forest declines. This is to be expected. By merging the categories we have guaranteed that the in-sample fit will decline. In the case of the random forest, the out-of-sample r-squared is better than the previous model, suggesting that the merging of the categories was not completely hopeless. However, the best model remains the regression model with all the industry categories and the judgmentally-encoded turnover variable.

Interactions

An interaction is a new variable, or set of variables, created by multiplying together predictor variables. To explain how they work with categorical variables it is necessary to delve a little into the detail of how predictive models deal with categorical variables.

The industry variable has 16 categories and the turnover variable has nine. When most (but not all) machine learning and statistical methods analyze a categorical variable they perform one-hot coding in the background. In the case of our 16 categories for industry, what this means is that 15 numeric variables are created and included in the model, one for all but the first of the categories. These are given a value of 1 when the data contains that category and a value of 0 otherwise. (Such variables are also known as dummy variables.) So, when we are estimating our model with industry and turnover, we are estimating 15 and 8 variables in the background to represent these two categorical variables.

A standard interaction of industry by age would create 16 * 8 = 128 variables in the background (it's actually a bit more complicated than this, but hopefully you get the idea). This considerably improves the in-sample fit, but tends to raise lots of problems in the out-of-sample predictive accuracy, as often combinations of categories that exist in the data used to estimate the model do not exist in the data set used to validate the model, and vice versa. The solution, then, when you have categorical variables with large numbers of categories and wish to create interactions, is to use merged variables when creating interactions and/or numerically encoded variables.

While you can create the interactions by hand, most predictive modeling software has automatic tools to make this easy. In Displayr, we do this by editing the R code (Object Inspector > Properties > R CODE), and adding a * or a : to the formula, as can be seen in the screenshot below (see Feature Engineering for Numeric Variables for the distinction between : and *).

The resulting model, shown at the bottom of the table has the best out-of-sample r-squared of all of the models considered so far. The out-of-sample r-squared is as good as the in-sample r-squared, but this is just good luck and is not the norm.

This post has provided a tour of some of the options for engineering categorical variables (features) when creating predictive models. The interesting thing to note is the substantial impact that can be achieved by judiciously transforming categorical variables, with the resulting model substantially better than the default models that can be obtained without transforming the data.

Explore the original dashboard

If you want to look at the various analyses discussed in this post in more detail (e.g., looking at the coefficients of all the models), click here to view the Displayr document that contains all the work. If you want to reproduce these analyses yourself, either with this data or some other data, please check out:

Click the button above to edit and explore the original analyses!

Feature Engineering for Numeric Variables

Tim Bock — Tue, 23 Oct 2018 17:00:33 +0000

When building a predictive model, it is often practical to improve predictive performance by modifying the numeric variables in some way. In statistics, this is usually referred to as variable transformation. In this post I discuss some of the more common transformations of a single numeric variable: ranks, normalizing/standardizing, logs, trimming, capping, winsorizing, polynomials, splines, categorization (aka bucketing, binning), interactions, and nonlinear models.

The goal of feature engineering for a numeric variable is to find a better way of representing the numeric variable in the model, where "better" connotes greater validity, better predictive power, and improved interpretation. In this post I am going to use two numeric variables, Tenure and Monthly Cost, from a logistic regression predicting churn for a telecommunications company (see How to Interpret Logistic Regression Outputs for more detail about this example). The basic ideas in this post are applicable to all predictive models, although some of these transformations have little effect on decision tree models (such as CART or CHAID), as these models only use orders, rather than the values, of numeric the predictor variables.

Ranks

The simplest way of transforming a numeric variable is to replace its input variables with their ranks (e.g., replacing 1.32, 1.34, 1.22 with 2, 3, 1). The rationale for doing this is to limit the effect of outliers in the analysis. If using R, Q, or Displayr, the code for transformation is rank(x), where x is the name of the original variable. The output below shows a revised model where Tenure has been replaced by Rank Tenure. If we look at the AIC for the new model it is 3,027.4, which is lower (which means better) than for the original model, telling us that the rank variable is a better variable. However, we have a practical problem which is that the estimated coefficient is 0.00. This is a rounding problem, so one solution is to look at more decimal places. However, a better solution is to transform the predictor so that it does not provide such a small estimate (this is desirable because computers can make rounding errors when working with numbers very close to 0, as can humans when looking at such numbers).

Standardizing/Normalizing

Standardizing - which is usually (but not always) the same thing as normalizing - means transforming a variable so that it has a mean of 0 and standard deviation of 1. This is done by subtracting the mean from each value of a variable and then dividing by its standard deviation. For example, 0, 2, 4 is replaced by -1, 0, and 1. In R, we can use scale(x) as a shortcut. The output below replaces Rank Tenure with its standardized form. There are three important things to note about the effect of standardizing. First, the estimate for (Intercept) changes. This is not important. Second, the estimate for the variable changes. In our case, it is now clearly distinct from 0. Third, the other predictors are not changed, unless they too are modified.

If all the variables are standardized it makes it easier to compare their relative effects, but harder to interpret the true meaning of the coefficients, as it requires you to always remember the details of the transformation (what the standard deviation was prior to the transformation).

Logs

In economics, physics, and biology, it is common to transform variables by taking their natural logarithm (in R: log(x)). For example, the values of 1, 3, and 4, are replaced by 0, 1.098612289, and 1.386294361.

The rationale for using the logarithm is that we expect a specific type of non-linear relationship. For example, economic theory tells us that we should expect that all else being equal, the higher the monthly charge, the more likely somebody will churn, but that this will have a diminishing effect (i.e., the difference between $100 and $101 should be smaller than the difference between $1 and $2). Using the the natural logarithm is consistent with such an assumption. Similarly, we would expect that the difference between a tenure of 1 versus 2 months is likely to be much bigger than the difference between 71 and 72 months.

The output below takes the logarithm of tenure. When compared to the previous models based on the AIC it is the best of the models. However, a closer examination reveals that something is amiss. The previous model has a sample size of 3,522, whereas the new model has a slightly smaller sample size. As sample size determines AIC, we have a problem: the AIC may be lower because the model is better or because of our missing data.

The problem with taking logarithmic transformations is that they do not work with values of less than or equal to 0, and in our example five people have a tenure of 0. The fix for this is simple: we add 1 to all the numbers prior to taking the natural logarithm. Below the output shows the results for this modified model. This latest model has our best AIC yet at 3,002.40, which is consistent with a very general conclusion about feature engineering: using common sense and theory is often the best way to determine the appropriate transformations.

Trimming

Trimming is where you replace the highest and lowest values of a predictor with missing values (e.g., the top 5% and the bottom 5%). At first blush this feels like a smart idea, as it removes the outliers from the analysis. However, after spending more than 20 years toying with this approach, my general experience is that it is never useful. This is because when you replace the actual values with missing values, you end up needing to find a way of adequately dealing with the missing values in the model. This is a substantially harder problem than finding a good transformation, as all of the standard approaches to dealing with missing values are inapplicable when data is trimmed (to use the jargon, data that is trimmed is nonignorable).

Winsorizing

Winsorizing, also known as clipping, involves replacing values below some threshold (e.g., the 5th percentile) with that percentile, and replacing values above some other threshold (e.g., the 95th percentile) with that value. With the tenure data, the 5th percentile is 1, and the 95th percentile is 72, so winsorizing involves recoding the values less than 1 as 1 and more than 72 as 72. In this example, 72 is also the maximum, so the only effect of winsorizing is to change the lowest values of 0 to 1. With the example being used in this post the winsorization had little effect, so the output is not shown. While in theory you can try different percentiles (e.g., 10th and 90th), this is a bit dangerous as there is no theory to guide such a decision, although using a histogram or density plot to identify extreme values can be useful. An alternative and often better approach is to use polynomials or splines (discussed later in this post). The following R code below winsorizes tenure.

 
x = tenure
quantiles = quantile(x, probs = c(0.05, 0.95))
x[x <= quantiles[1]] = quantiles[1]
x[x <= quantiles[2]] = quantiles[2]
x

Capping

Capping is the same basic idea as winsorizing, except that you only apply the recoding to the higher values. This can be particularly useful with data where the very highest values are likely to be extreme (e.g., as with income and house price data). The following code caps the tenure data at 30:

 
x = tenure
x[x > 30] = 30
x

The output from the model with tenure capped at 30 is shown above. The model is better than our initial model, but not as good as any of the more recent models. The reason why it performs better than the original model can be understood by looking at its coefficient of -0.06, which is twice the coefficient of the first model (-0.03), which tells us that the effect of tenure is comparatively greater for the lower values of tenure (as hypothesized in the discussion of logarithms).

Polynomials

When we take the logarithm we are asserting a specific non-linear relationship. In economics, where sample sizes are often very small, this is often a good thing to do. However, in our data set we have a much larger sample, so it makes sense to use a more general non-linear specification and try and extract the nature of the nonlinearity from the data. The simplest way to do this is to fit a quadratic model, which is done by both including the original numeric variable and a new variable that contains its square (in R: x^2). The resulting model for tenure is shown below. This one is actually worse than our previous model. It is possible to also use cubics and higher order polynomials, but it is usually better practice to fit splines, discussed in the next section.

If you do wish to use polynomials, rather than manually computing them, it is usually better to use R's in-built poly function. For example, in R, poly(x, 5) will create the first five polynomials. The cool thing about how this works is that it creates these polynomials so that they are orthogonal, which avoids many of the fitting problems that can occur with higher order polynomial calculated in the traditional way (e.g., x^5) due to multicollinearity. If adding polynomials to a data set in Displayr, you will need to add them one by one (e.g., the fourth variable would be poly(x, 5)[, 4]. Use orthogonal polynomials with care when making predictions, as the poly function will give a different encoding for different samples.

Splines

Where there is a numeric predictor and we wish to understand its nonlinear relationship to the outcome variable, best practice is usually to use a regression spline, which simultaneously fits the model and estimates the nature of the nonlinear relationship. This is a bit more complicated than any of the models used so far, and is usually done by writing code. Below I show the code and the main numerical output from fitting a generalized additive logistic regression:

 
library(mgcv)
churn.gam = gam(Churn_cat ~ SeniorCitizen + InternetService_cat + Contract_cat + MonthlyCharges + s(Tenure), 
                subset = training == 1,
                family = binomial(logit))

The key output for our analysis is a plot showing the the estimated nonlinear relationship, which is shown below.

 
plot(churn.gam, ylab = "Coefficient of tenure"))

The way that we read this is that the tenures are shown on the x-axis, and we can look up the coefficient (effect) for each of these. We can see, for example, that the coefficient is about 1.75 for a tenure of 0 months, but this drops quickly to around 0.4 after 10 months, after which the drop-off rate declines, and declines again at around 24 months. Although the spline is very cool and can detect things that have not been detected by any of the other models, the model's resulting AIC is 3,012, which is not as good as the logarithmic model, suggesting that the various wiggles in the plot reflect over-fitting rather than insight.

Bucketing/binning/categorization

The last approach is to convert the numeric variable into a categorical variable. This can be done judgmentally or via percentiles. In the output below I show the results where I have split the data into deciles (cut(tenure, breaks = 10)) and set the variable as a categorical variable when estimating the model. The first decile is people with tenures from 0 to 7, and is defined as having an estimate of 0 (see How to Interpret Logistic Regression Coefficients for more info about how to interpret coefficients). We can see that the second decile, which is for tenures of 8 to 14, has a much lower coefficient, and then the next one is a lower again, but the overall trajectory is very similar to what we saw with the spline.

The bucketing is worse than the spline, and this is pretty much always the case. However, the great advantage of bucketing is that it is really simple to do and understand, making it practical to implement this with any predictive model. By contrast, splines are only practical if using advanced statistical models, and these can be tricky things to get working well if you haven't spent years in grad school.

Interactions

An interaction is a new variable that is created by multiplying together two or more other variables. For example, we can interact tenure and monthly charges by creating a new numeric variable with the code Tenure * `Monthly Charges`. Note that in this example, backticks (which on an international keyboard is the key above the Tab key) surround monthly charges, which is the way to refer to variables in Displayr by their label rather than their name.

If specifying a lot of interactions it can be a bit painful to manually create a variable for each of them. An alternative is to edit the regression formula by going to the R code (Object Inspector > Properties > R CODE), and adding a * to the formula, as can be seen in the screenshot below. Note that when we do this, the regression will automatically estimate three coefficients: one for Monthly Charges, one for LogTenurePlus1, and one for their interaction. If we only wanted to create the interaction we would instead write MonthlyCharges:LogTenurePlus1.

Nonlinear models

Splines are really a nonlinear model rather than a form of feature engineering, and this highlights that sometimes we can avoid the need for feature engineering by using explicit statistical and machine learning models that are designed to detect and adjust for nonlinearity, such as decision trees, splines, random forests, and deep learning. Although such methods can be highly useful, my experience is that even when using such methods it usually pays off to try the various types of transformations described in this post.

Explore the original dashboard

If you want to look at the various analyses discussed in this post in more detail, click here to get a copy of the Displayr document that contains all the work. If you want to reproduce these analyses yourself, either with this data or some other data, please check out:

Ready to get started? Click the button above to view and edit these models!

Building Online Interactive Simulators for Predictive Models in R

Tim Bock — Tue, 25 Sep 2018 17:00:07 +0000

Correctly interpreting predictive models can be tricky. One solution to this problem is to create interactive simulators, where users can manipulate the predictor variables and see how the predictions change. This post describes a simple approach for creating online interactive simulators. It works for any model where there is a predict method. Better yet, if the model's not top secret, you can build and share the model for no cost, using the free version of Displayr!

In this post I show how to describe the very simple simulator shown below. Click the image to interact with it, or click the button below to explore and edit the code.

Explore and edit this simulator

Step 1: Create the model

The first step is to create a model. There are lots of ways to do this, including:

Creating the model using R code from within Displayr. I illustrate this below.
Pasting in estimates that you have already computed (Insert > Paste Table).
Using Displayr's graphical user interface.
Creating an R model somewhere else, saving it somewhere on the web (e.g., Dropbox), and then reading it into Displayr using readRDS. (See How to Link Documents in Displayr for a discussion of some utilities we have created for reading from Dropbox.)

In this post I will illustrate by using one of my all-time favorite models - a generalized additive model - via the gam function in the mgcv package. The process for creating this in Displayr is:

Log in to Displayr (if you don't already have an account, click the GET DISPLAYR FREE button at the top-right of the screen).
Press Insert > R Output (Analysis).
Enter your code into the R Output and press the CALCULATE button at the top of the Object Inspector. In the example below I have fitted a GAM using some of IBM's telco churn example data.

Step 2: Add controls for each of the predictors

Press Insert > Control (More) (this option is on the far right of the ribbon).
In the Object Inspector > Properties > GENERAL, set the Name to cSeniorCitizen. You can give it any name you wish, but it is usually helpful to have a clear naming standard. In this example, I am using c so that whenever I refer to the control in code it is obvious to me that it is a control.
Click on the Control tab of the Object Inspector and set the Item list to No; Yes, which means that the user will have a choice between No and Yes when using the control.
Press Insert > Text box and click and drag to draw a text box to the left of the control. Type Senior Citizen into the text box, set it to be right-aligned (in the Appearance tab of the ribbon), with a font size of 10. You can micro-control layout by selecting the textbox, holding down your control key, and clicking the arrow keys on your keyboard.
Click on the control and select No. It should look as shown below.

Now, using shift and your mouse, select the text box and the control and press Home > Duplicate, and drag the copies to be neatly arranged underneath. Repeat this until you have four sets of labels and controls, one under each other.
Update the textboxes, and each control's Name, and Item list, as follows:
- Tenure (months), cTenure: 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10; 11; 12; 13; 14; 15; 16; 17; 18; 19; 20; 21; 22; 23; 24; 25; 26; 27; 28; 29; 30; 31; 32; 33; 34; 35; 36; 37; 38; 39; 40; 41; 42; 43; 44; 45; 46; 47; 48; 49; 50; 51; 52; 53; 54; 55; 56; 57; 58; 59; 60; 61; 62; 63; 64; 65; 66; 67; 68; 69; 70; 71; 72
- Internet service, cInternetService: No; DSL; Fiber optic
- Monthly charges, cMonthlyCharges: $0; $10; $20; $30; $40; $50; $60; $70; $80; $90; $100; $110; $120
Select any option from each of the controls (it does not matter which you choose).

Step 3: Computing the prediction

Press Insert > R Output (Analysis) and then enter the code below, modifying it as per your needs. For example, with the code SeniorCitizen = cSeniorCitizen, the variable name used in the model is SeniorCitizen and cSeniorCitizen is the name of the control.

The item names in the control must exactly match the values of the variables in the data set. It is for this reason that the MonthlyCharges code is a bit more complicated, as it needs to strip out the $ from the control and convert it into a number (as the variable in the data set just contains numbers).

   
predict(my.gam,
        type = "response",
        newdata = data.frame(SeniorCitizen = cSeniorCitizen, 
                    Tenure = as.numeric(cTenure),
                    InternetService = cInternetService,
                    MonthlyCharges = as.numeric(gsub("\\$", "", cMonthlyCharges))))[1] * 100

Confidence bands

Provided that the predict method supports them, the same approach easily extends to computing confidence intervals and other quantities from models. This code snippet computes the confidence intervals for the GAM used above.

 
pred <- predict(my.gam,
        se.fit = TRUE,
        newdata = data.frame(SeniorCitizen = cSeniorCitizen, 
                    Tenure = as.numeric(cTenure),
                    InternetService = cInternetService,
                    MonthlyCharges = as.numeric(gsub("\\$", "", cMonthlyCharges))))
bounds = plogis(pred$fit + c(-1.96, 0, 1.96) * pred$se.fit) * 100
names(bounds) = c("Lower 95% CI", "Predicted", "Upper 95% CI")
bounds

Computing predictions from coefficients

And, of course, you can also make predictions directly from coefficients, rather than from model objects. For example, the following code makes a prediction for a logistic regression:

 
coefs = my.logistic.regression$coef
XB = coefs["(Intercept)"] + 
        switch(cSeniorCitizen, 
               No = 0, 
               Yes = coefs["SeniorCitizenYes"]) +
        as.numeric(cTenure) * coefs["Tenure"] +
        switch(cInternetService, 
               No =  coefs["InternetServiceNo"], 
               "Fiber optic" = coefs["InternetServiceFiber optic"], 
               DSL = 0) +
        as.numeric(gsub("\\$", "", cMonthlyCharges)) * coefs["MonthlyCharges"]
100 / (1 + exp(-XB))

Making safe predictions

Sometimes models perform "unsafe" transformations of the data in their internals. For example, some machine learning models standardize inputs (subtract the mean and divide by standard deviation). This can create a problem at prediction time, as the predict method may, in the background, attempt to repeat the standardization using the data for the prediction. This will cause an error (as the standard deviation of a single input observation is 0). Similarly, it is possible to create unsafe predictions from even the most well-written model (e.g., if using poly or scale in your model formula). There are a variety of ways of dealing with unsafe predictions, but a safe course of action is to perform any transformations outside of the model (i.e., not in the model formula).

Step 4: Export the simulator

If everything has gone to plan you can now use the simulator. To export it so that others can use it, click Export > Web Page, and you can then share the link with whoever you wish. The version that I have created here is very simple, but you can do a lot more if you want to make something pretty or more detailed (see the Displayr Dashboard Showcase for more examples).

Explore and edit this simulator

Click here to interact with the published dashboard, or click here to open a copy of the Displayr document that I created when writing this post. It is completely live, so you can interact with it. Click on any of the objects on the page to view the underlying R code, which will appear in the Object Inspector > Properties > R CODE.

Ready to get started? Create your own simulator for free in Displayr!

What is Deep Learning?

Jake Hoare — Tue, 14 Aug 2018 13:00:46 +0000

Deep learning can be distinguished from machine learning in general because it learns a hierarchy of structures from the training data. Although other deep techniques exist, the phrase, deep learning is used almost exclusively to describe deep neural networks.

Deep learning and image recognition

A common application of deep learning is the recognition of objects in images. The input training data is a set of images, each of which consists of many thousands of pixels. Each pixel of each image is represented by a real number between zero and one. Zero indicates black and one indicates white (assuming, for the sake of simplicity, that the images are greyscale rather than color). The target outcomes are a set of labels such as “cat”, “chair” or “car”.

A deep neural network consists of many layers of neurons. During training, the shallower layers (closer to the input data) learn to identify simple concepts such as edges, corners or circles. Progressing deeper into the network, each successive layer learns increasingly complex concepts, such as eyes or wheels. The final layer learns the target labels.

Through this process, information is distilled and compressed, starting with the raw pixel values and ending with a general concept. This is illustrated above.

As well as images, deep learning can be applied to video, natural language processing (i.e., text) and audio (e.g., speech) data.

Recent growth

Recently, the popularity and application of deep learning has increased significantly. This can be attributed to three main technological factors:

The availability of large data sets. The internet has created and enabled the sharing of vast amounts of data from sources like social media, the internet-of-things, and e-commerce. Furthermore, specialist domains such as astronomy, healthcare and finance have their own electronic data, which has been made more accessible by advances in databases and storage.
Increasing computation power. The general increase in processing power associated with Moore’s law has benefited all computing. More specifically, deep learning is based upon linear algebra and uses ASIC and GPU chips specifically designed for parallel processing.
Advances in learning algorithms. Convolutional and recurrent neural networks are two of the main techniques to utilize deep architectures. New algorithms for gradient descent have improved performance. Overfitting can be mitigated by regularization and dropout.

Want to know more? Check out the rest of our "What is..." guides, or learn more about machine learning! Or, you can make your own neural network by clicking on this link and following the prompts.

How Random Forests Fit to Data

Jake Hoare — Mon, 06 Aug 2018 13:00:32 +0000

If you are not familiar with random forests, see my earlier article “What is a Random Forest?” before proceeding with this one. The process of fitting a single decision tree is described in “How is Splitting Decided for Decision Trees?” Random forest trees follow similar steps, with the following differences.

Data sampling

The training data for each tree is created by sampling from the full data set with replacement. This process is illustrated below.

The column on the left contains all the training data. The random samples have the same total number of cases as all the training data. Because cases are chosen randomly, some cases are repeated within a sample, and some cases are not present. Each sample consists of different cases.

The cases that are not chosen for a given sample are called out-of-bag. Although they will not used for that specific tree, they may perform a role in accuracy measurement of the forest. They may also be present in the sample for another tree.

Variable sampling

When deciding which variable to split in a forest, only certain variables are considered. If there are p predictor variables, then usually sqrt(p) are randomly chosen for consideration for each split in a classification task. In a regression task, p/3 variables are randomly chosen.

Given the data and variable sampling, each tree is trained in a process very similar to the way a single decision tree is trained. The data is passed down the tree and at each node the best splitting variable is chosen. The data is partitioned according to the split to form two new nodes. This process repeats until we reach a leaf.

No early stopping

An important difference between training a single tree and a tree within a forest is that for classification tasks, forest trees are usually trained until the leaf nodes contain one sample, or only samples from a single class. By contrast, training of a stand-alone tree usually stops before such leaf purity to avoid overfitting. Stand-alone trees may also use cross-validation and pruning to stop training. Neither are used within a forest. For regression tasks (i.e., predicting a numeric outcome) forest trees usually stop training with leaves containing five or fewer samples.

Prediction

When classifying outputs, the prediction of the forest is the most common prediction of the individual trees. For regression, the forest prediction is the average of the individual trees.

Because forests average many different trees, each of which are built and trained differently, the predictions of the forest exhibit less overfitting than a single tree. The random choices of splitting variable reduce the variance below that achievable through bagging alone, at the cost of a slight increase in bias.

Make your own random forest in Displayr by clicking on this link and following the prompts, or check out our other blogs about data science!

How is Splitting Decided for Decision Trees?

Jake Hoare — Thu, 02 Aug 2018 13:00:14 +0000

If you want to create your own decision tree, you can do so using this decision tree template.

What is a decision tree?

This posts builds on the fundamental concepts of decision trees, which are introduced in this post.

Decision trees are trained by passing data down from a root node to leaves. The data is repeatedly split according to predictor variables so that child nodes are more “pure” (i.e., homogeneous) in terms of the outcome variable. This process is illustrated below:

The root node begins with all the training data. The colored dots indicate classes which will eventually be separated by the decision tree. One of the predictor variables is chosen to make the root split. This creates three child nodes, one of which contains only black cases and is a leaf node. The other two child nodes are then split again to create four more leaves. All the leaves either contain only one class of outcome, or are too small to be split further.

Create your own Decision Tree

At every node, a set of possible split points is identified for every predictor variable. The algorithm calculates the improvement in purity of the data that would be created by each split point of each variable. The split with the greatest improvement is chosen to partition the data and create child nodes.

Choosing the set of split points to test

The set of split points considered for any variable depends upon whether the variable is numeric or categorical. The values of the variable taken by the cases at that node also play a role.

When a predictor is numeric, if all values are unique, there are n – 1 split points for n data points. Because this may be a large number, it is common to consider only split points at certain percentiles of the distribution of values. For example, we may consider every tenth percentile (that is, 10%, 20%, 30%, etc).

When a predictor is categorical we can decide to split it to create either one child node per class (multiway splits) or only two child nodes (binary split). In the diagram above the Root split is multiway. It is usual to make only binary splits because multiway splits break the data into small subsets too quickly. This causes a bias towards splitting predictors with many classes since they are more likely to produce relatively pure child nodes, which results in overfitting.

If a categorical predictor has only two classes, there is only one possible split. However, if a categorical predictor has more than two classes, various conditions can apply.

If there is a small number of classes, all possible splits into two child nodes can be considered. For example, for classes apple, banana and orange the three splits are:

	Child 1	Child 2
Split 1	apple	banana, orange
Split 2	banana	apple, orange
Split 3	orange	apple, banana

For k classes there are 2^{k – 1} – 1 splits, which is computationally prohibitive if k is a large number.

If there are many classes, they may be ordered according to their average output value. We can the make a binary split into two groups of the ordered classes. This means there are k – 1 possible splits for k classes.

If k is large, there are more splits to consider. As a result, there is a greater chance that a certain split will create a significant improvement, and is therefore best. This causes trees to be biased towards splitting variables with many classes over those with fewer classes.

Calculating the improvement for a split

When the outcome is numeric, the relevant improvement is the difference in the sum of squared errors between the node and its child nodes after the split. For any node, the squared error is:

where n is the number of cases at that node, c is the average outcome of all cases at that node, and y_i is the outcome value of the ith case. If all the y_i are close to c, then the error is low. A good clean split will create two nodes which both have all case outcomes close to the average outcome of all cases at that node.
When the outcome is categorical, the split may be based on either the improvement of Gini impurity or cross-entropy:

where k is the number of classes and p_i is the proportion of cases belonging to class i. These two measures give similar results and are minimal when the probability of class membership is close to zero or one.

Example

For all the above measures, the sum of the measures for the child nodes is weighted according to the number of cases. An example calculation of Gini impurity is shown below:

The initial node contains 10 red and 5 blue cases and has a Gini impurity of 0.444. The child nodes have Gini impurities of 0.219 and 0.490. Their weighted sum is (0.219 * 8 + 0.490 * 7) / 15 = 0.345. Because this is lower than 0.444, the split is an improvement.

One challenge for this type of splitting is known as the XOR problem. When no single split increases the purity, then early stopping may halt the tree prematurely. This is the situation for the following data set:

You can make your own decision trees in Displayr by using the template below.

What is a Decision Tree?

Jake Hoare — Wed, 01 Aug 2018 13:00:27 +0000

To create your own decision tree, use the template below.

The decision tree is typically read from top (root) to bottom (leaves). A question is asked at each node (split point) and the response to that question determines which branch is followed next. The prediction is given by the label of a leaf.

The diagram below shows a decision tree which predicts how to make the journey to work.

The first question asked is about the weather. If it’s cloudy, then the second question asks whether I am hungry. If I am, then I walk, so I can go past the café. However, if it’s sunny then my mode of transport depends on how much time I have.

The responses to questions and the prediction may be either:

Binary, meaning the response is yes/no or true/false as per the hungry question above
Categorical, meaning the response is one of a defined number of possibilities, e.g. the weather question
Numeric, an example being the time question

Create your own Decision Tree

How a decision tree is created

The small example above represents a series of rules such as “If it’s raining, I take the bus.” If the rules are known in advance, the tree could be built manually.

In real-world examples, we often don’t have rules, but instead have examples. The examples are in the form of a data set of instances or observations. Each instance consists of several predictor variables and a single outcome. The predictor variables are the questions and the outcome is the prediction. An example of such data is shown in the table below.

Outcome	Weather	Hungry	Time
Bus	Rain	No	>30 mins
Walk	Cloud	Yes	<30 mins
Walk	Sun	No	>30 mins
Bus	Cloud	No	>30 mins
Bus	Sun	Yes	<30 mins

Given this data, the general framework for building a decision tree is as follows:

Set the first node to be the root, which considers the whole data set.
Select the best variable to split at this node.
Create a child node for each split value of the selected variable.
For each child, consider only the data with the split value of the selected variable.
If the examples are perfectly classified then stop. The node is a leaf.
Otherwise repeat from step 2 for each child node until a leaf is reached.

This outline is followed by popular tree-building algorithms as CART, C4.5 and ID3.

This is a greedy algorithm, meaning that for each node, it uses local information to find best split for that node. An implication is that it may be possible to create a better tree by changing the order of the splitting variables.

Trees have a high degree of flexibility in the relationships that they can learn, which is known as having low bias. The downside of this is that they can learn the noise in the data, known as high variance. High variance often leads to overfitting, whereby the tree makes over-confident predictions.

Create your own Decision Tree

Advantages of decision trees

There are several reasons to consider decision trees, including:

The tree output is easy to read and interpret
They are able to handle non-linear numeric and categorical predictors and outcomes
Decision trees can be used as a baseline benchmark for other predictive techniques
They can be used as a building block for sophisticated machine learning algorithms such as random forests and gradient-boosted trees

Disadvantages of decision trees

Trees are subject to overfitting, due to the data being repeatedly split into smaller subsets. However, overfitting may be mitigated by pruning and early-stopping.
The greedy algorithm causes a bias towards the best splitting variables being closest to the root.
For complex problems, the performance may be inferior to that of more sophisticated predictive techniques.

For more information on decision trees, see “How is Splitting Decided for Decision Trees?” and "Pruning Decision Trees".

You can create your own decision trees in Displayr by clicking on the template below.

How is Variable Importance Calculated for a Random Forest?

Jake Hoare — Mon, 30 Jul 2018 13:59:44 +0000

This post builds on my earlier description of random forests. We recommend reading that post first for context.

The example below shows the importance of eight variables when predicting an outcome with two options. In this instance, the outcome is whether a person has an income above or below $50,000.

There are two measures of importance given for each variable in the random forest. The first measure is based on how much the accuracy decreases when the variable is excluded. This is further broken down by outcome class. The second measure is based on the decrease of Gini impurity when a variable is chosen to split a node. See this article for more information on Gini.

Train your own random forest

Accuracy-based importance

Each tree has its own out-of-bag sample of data that was not used during construction. This sample is used to calculate importance of a specific variable. First, the prediction accuracy on the out-of-bag sample is measured. Then, the values of the variable in the out-of-bag-sample are randomly shuffled, keeping all other variables the same. Finally, the decrease in prediction accuracy on the shuffled data is measured.

The mean decrease in accuracy across all trees is reported. This importance measure is also broken down by outcome class. For example, age is important for predicting that a person earns over $50,000, but not important for predicting a person earns less.

Intuitively, the random shuffling means that, on average, the shuffled variable has no predictive power. This importance is a measure of by how much removing a variable decreases accuracy, and vice versa — by how much including a variable increases accuracy.

Note that if a variable has very little predictive power, shuffling may lead to a slight increase in accuracy due to random noise. This in turn can give rise to small negative importance scores, which can be essentially regarded as equivalent to zero importance.

Train your own random forest

Gini-based importance

When a tree is built, the decision about which variable to split at each node uses a calculation of the Gini impurity.

For each variable, the sum of the Gini decrease across every tree of the forest is accumulated every time that variable is chosen to split a node. The sum is divided by the number of trees in the forest to give an average. The scale is irrelevant: only the relative values matter. In the example above, occupation is over five times more important than country.

The importances are roughly aligned between the two measures, with numeric variables age and hrs_per_week being lower on the Gini scale. This may indicate a bias towards using numeric variables to split nodes because there are potentially many split points.

Importance for numeric outcomes

The previous example used a categorical outcome. For a numeric outcome (as show below) there are two similar measures:

Percentage increase in mean square error is analogous to accuracy-based importance, and is calculated by shuffling the values of the out-of-bag samples.
Increase in node purity is analogous to Gini-based importance, and is calculated based on the reduction in sum of squared errors whenever a variable is chosen to split.

Train your own random forest

Summary

One advantage of the Gini-based importance is that the Gini calculations are already performed during training, so minimal extra computation is required. A disadvantage is that splits are biased towards variables with many classes, which also biases the importance measure. Both methods may overstate the importance of correlated predictors.

Neither measure is perfect, but viewing both together allows a comparison of the importance ranking of all variables across both measures. For further reading, see this paper and these slides.

Run your own Random Forest with Displayr

This analysis was done in Displayr. To see Displayr in action, get started below.

Four Ways to Write Better Stan Code

Mathew McLean — Wed, 11 Jul 2018 23:00:20 +0000

1. Improve sampler efficiency by picking the right model

We need to address how we specify our models before even discussing writing code that is optimized computationally. Make sure that your model is parameterized in such a way that Stan can easily sample it using its algorithms - No U-TURN Sampler (NUTS) or Hamiltonian Monte Carlo (HMC). Simply selecting the right model can make a big difference in terms of efficiency. Due to differences in implementations and algorithms, an efficient parameterization in stan is not necessarily the one that was best in other software we've tried.

A poorly specified model will require more samples to reach convergence and adequately explore the posterior distribution or they may not converge at all. For some models, reparameterization can be an effective means of improve sampling efficiency by replacing a complex distribution Stan has difficulty sampling from, with one from which it can draw more easily. One example discussed in Section 28.6 of the Stan manual involves reparameterizing the Cauchy distribution, a challenge for Stan to sample from because of the heavy tails. The difficulties can be fixed by instead sampling a uniform random variable and using the probability integral transform.

2. Matrices or arrays, pick carefully!

It can also be confusing whether to use matrices or arrays when writing your code. There are actually four different ways to specify a two-dimensional collection of real numbers! But which one should you pick? This largely depends on the operations you need to perform. If you need to do matrix computations, you should be using a matrix. However, if you frequently need to index into the rows of the matrix it is more efficient to use arrays. In this situation, it will save you a headache to declare an array of type row_vector than to work with matrices.

Matrices and arrays should also be traversed in different order. Loops involving arrays should be traversed with the last index varying fastest, whereas the opposite is true for matrices. Additionally, traversing through matrices is slightly more efficient. If for example your code involves an I x J array of matrices each having dimension R x C, then the most efficient way to write a loop that traverses every element is:

matrix[R,C] a[I,J];
for (i in 1:I)
for (j in 1:J)
for (c in 1:C)
for (r in 1:R)
a[i,j,r,c] = ......

3. Let built-in functions and vectorization save you time

Stan has a number of optimized built-in functions for common computations such as dot products and special matrix multiplications. You should use these whenever possible to save yourself from having to write your own code to perform the calculations. There are also functions that will improve the speed by vectorizing code. For example, this loop

matrix[R,C] m;
for(r in 1:R)
m[r] = normal(0,1);

should be replaced with the faster, vectorized code:

matrix[R,c] m;
to_vector(m) ~ normal(0,1);

4. Customize with compiler optimizations

Because Stan relies on compiled C++ code, it may be possible for advanced users to further optimize by changing compiler flags in R's version of a Makefile, known as a Makevars file. It is recommended to do this in a user specific file located in ~/.R/Makevars. Be careful though - using overly aggressive compiler options can result in code that is not portable across machines and architectures. The trade-off however is code that is as fast as possible on your own computer. The RStan installation guide recommends adding CXXFLAGS==-O3 to the Makevars file for the highest level of overall optimization possible, but be warned - this can result in increased memory usage and larger binary file sizes!

We hope these four tips help you improve the efficiency of your coded models! Check out more tips and tricks here.

What is a ROC Curve and How to Interpret It

Carmen Chan — Thu, 05 Jul 2018 05:25:06 +0000

A Receiver Operator Characteristic (ROC) curve is a graphical plot used to show the diagnostic ability of binary classifiers. It was first used in signal detection theory but is now used in many other areas such as medicine, radiology, natural hazards and machine learning. In this post I'll show you how a ROC curve is created and how to interpret the ROC curve.

An example is shown below:

Create your own ROC curve

Creating a ROC curve

A ROC curve is constructed by plotting the true positive rate (TPR) against the false positive rate (FPR). The true positive rate is the proportion of observations that were correctly predicted to be positive out of all positive observations (TP/(TP + FN)). Similarly, the false positive rate is the proportion of observations that are incorrectly predicted to be positive out of all negative observations (FP/(TN + FP)). For example, in medical testing, the true positive rate is the rate in which people are correctly identified to test positive for the disease in question.

A discrete classifier that returns only the predicted class gives a single point on the ROC space. But for probabilistic classifiers, which give a probability or score that reflects the degree to which an instance belongs to one class rather than another, we can create a curve by varying the threshold for the score. Note that many discrete classifiers can be converted to a scoring classifier by ‘looking inside’ their instance statistics. For example, a decision tree determines the class of a leaf node from the proportion of instances at the node.

Create your own ROC curve

Interpreting the ROC curve

The ROC curve shows the trade-off between sensitivity (or TPR) and specificity (1 – FPR). Classifiers that give curves closer to the top-left corner indicate a better performance. As a baseline, a random classifier is expected to give points lying along the diagonal (FPR = TPR). The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

Note that the ROC does not depend on the class distribution. This makes it useful for evaluating classifiers predicting rare events such as diseases or disasters. In contrast, evaluating performance using accuracy (TP +

TN)/(TP + TN + FN + FP) would favor classifiers that always predict a negative outcome for rare events.

Create your own ROC curve

Area under curve (AUC)

To compare different classifiers, it can be useful to summarize the performance of each classifier into a single measure. One common approach is to calculate the area under the ROC curve, which is abbreviated to AUC. It is equivalent to the probability that a randomly chosen positive instance is ranked higher than a randomly chosen negative instance, i.e. it is equivalent to the two sample Wilcoxon rank-sum statistic.

A classifier with high AUC can occassionally score worse in a specific region than another classifier with lower AUC. But in practice, the AUC performs well as a general measure of predictive accuracy.

Create your own ROC curve in Displayr. Sign up below to get started.

Predict Customer Churn with Gradient Boosting

Jake Hoare — Mon, 02 Jul 2018 19:30:03 +0000

Why predict customer churn?

Customer retention should be a top priority of any business as acquiring new customers is often far more expensive that keeping existing ones. It is no longer a given that long standing customers will remain loyal given the numerous options in the market. Therefore, it is vital that companies can proactively determine the customers most at risk of leaving and take preventative measures against this.

Predictive models for customer churn can show the overall rate of attrition, while knowledge of how the churn rate varies over a period of time, customer cohort, product lines and other changes can provide numerous valuable insights. Yet, customers also vary enormously in their behaviors and preferences which means that applying a simple "rule of thumb" analysis will not work. Here's where a predictive model using gradient boosting can help you.

First, I'm going to describe the data. Then I'll use gradient boosting to predict who will churn and who will stay. Finally I'll benchmark my result against other models.

Try Gradient Boosting

The data

I'll aim to predict Churn, a binary variable indicating whether a customer of a telecoms company left in the last month or not.

To do this I'll use 19 variables including:

Length of tenure in months.
Types of services signed up for such as phone, internet and movie streaming.
Demographic information.
Monthly charges, type of contract and billing.

The full data set is available here.

The breakdown of Churn is shown below. If we predict No (a customer will not churn) for every case, we can establish a baseline. Our baseline establishes that 73% is the minimum accuracy that we should improve on.

Gradient boosting

Try Gradient Boosting

In this earlier post I explained gradient boosting. Gradient boosting sits alongside regression, decision trees, support vector machines and random forests. They are all supervised learning algorithms capable of fitting a model to train data and make predictions.

A common strategy when working with any of these models is to split the data into a training sample and a testing sample. The model learns the associations between the predictor variables and the target outcome from the training sample. The testing sample is used to provide an unbiased estimate of the prediction accuracy on unseen data.

I randomly split the data into a a 70% training sample and a 30% testing sample. I then perform gradient boosting with an underlying tree model. The chart below shows the 10 most important variables. We learn that having a monthly contract, length of tenure and amount of charges are useful predictors of churn.

How accurately can we predict customer churn with gradient boosting?

Gradient boosting has various internal parameters know generically as hyper-parameters. These settings determine the size of the underlying trees and the impact that each round of boosting has on the overall model. It can be time consuming to explore all of the possibilities to find the best values. To create the model below I automatically performed a grid search of 36 different combinations of hyper-parameters. I selected the best set by 5-fold cross validation.

We've already established that our baseline when always predicting that a customer will not churn is 73%. This amounts to a shot in the dark when trying to determine whether or not a customer will churn. Not great, right? However, when we can input more information like different variables such as a person's bills, their contract length or tenure and their comfort level with the technology, we can learn much more about this customer.

From this information, we can more accurately pinpoint who will churn and our prediction accuracy rises by 8% to 80.87%. This gives us a much greater edge in being able to identify the factors that may lead to customer attrition and more of an edge when it comes to the crucial business of customer retention!

Try Gradient Boosting

Why choose gradient boosting over other models?

In the same way that I just fitted a gradient boosting model, we can fit other models. I tried 3 other approaches. Each time I followed the same procedure as above, selecting the same variables, fitting with the training sample and calculating accuracy from the testing sample. The results are:

Model	Accuracy
Gradient Boosted Tree	80.87%
CART	79.21%
Random Forest	79.94%
Linear Discriminant Analysis	79.97%

Whilst this is not a comprehensive comparison, gradient boosting performs the best amongst these models with the highest accuracy score.

TRY IT OUT
The analysis in this post was performed in Displayr using R. The flipMultivariates package, which uses the xgboost package, performs the machine learning calculations. You can try this analysis for yourself in Displayr.

What is a Random Forest?

Jake Hoare — Mon, 07 May 2018 06:35:51 +0000

A random forest is an ensemble of decision trees. Like other machine-learning techniques, random forests use training data to learn to make predictions.

One of the drawbacks of learning with a single tree is the problem of overfitting. Single trees tend to learn the training data too well, resulting in poor prediction performance on unseen data. This is also known as variance and results in a model that is sensitive to small changes in the training data. Although various techniques (pruning, early stopping and minimum split size) can mitigate tree overfitting, random forests take a different approach.

Random forests use a variation of bagging whereby many independent trees are learned from the same training data. A forest typically contains several hundred trees.

Comparison to single decision trees

There are three main areas that differentiate the training of random forests from single trees:

The training data for each tree is created by sampling from the full data set with replacement.
Only a subset of variables is considered when deciding how to split each node.
Random forest trees are trained until the leaf nodes contain one or very few samples.

When classifying outputs, the prediction of the forest is the most common prediction of the individual trees. For regression, the forest prediction is the average of the individual trees.

Disadvantages of random forests

Although random forests can be an improvement on single decision trees, more sophisticated techniques are available. Prediction accuracy on complex problems is usually inferior to gradient-boosted trees.
A forest is less interpretable than a single decision tree. Single trees may be visualized as a sequence of decisions.
A trained forest may require significant memory for storage, due to the need for retaining the information from several hundred individual trees.

Advantages of random forests

Works well “out of the box” without tuning any parameters. Other models may have settings that require significant experimentation to find the best values.
Tend not to overfit. The processes of randomizing the data and variables across many trees means that no single tree sees all the data. This helps to focus on the general patterns within the training data and reduce sensitivity to noise.
Ability to handle non-linear numeric and categorical predictors and outcomes. Other models may require numeric inputs or assume linearity.
Accuracy calculated from out-of-bag samples is a proxy for using a separate test data set. The out-of-bag samples are those not used for training a specific tree and as such can be used as an unbiased measure of performance.
Predictor variable importance can be calculated. For more information, see “How is Variable Importance Calculated for Random Forests?”

TRY IT OUT
You can build a random forest for yourself in Displayr, just follow the instructions.