knitr::opts_chunk$set(echo = TRUE, warning = FALSE, verbose = FALSE, message = FALSE)

# Clear any existing variables
rm(list = ls())

# Load all packages needed to execute the job
# If the packages are not installed, write
# install.packages("<name of package>")

library(ggplot2)    # plot
library(glmnet)     # lasso
library(grf)        # generalized random forests
library(sandwich)   # for robust CIs
library(tableone)   # compute standardized mean differences (SMDs)
library(cobalt)     # plot SMDs
library(FactoMineR) # visualization and pca methods, FactoMineR website and youtube chanel
library(Matching)   # for IPW estimates with valid std errors
library(survey)     # for robust standard errors

Problem and data description

This dataset is simulated and the variable names and relationships are not necessarily representing a realistic scenario. However, to facilitate commenting and interpreting the results, we use plausible variable names. We will assume that our goal is to study the impact of nutrition change on development of Type 2 diabetes after a prediabetes diagnosis.

The source of the data informs us that the true effect on this population is -0.09 (since these data are synthetically generated, it is indeed possible to have this information. Another possibility, in a real world setting, would be to conduct a randomized controlled trial on the population of interest.)

In this tutorial, we will use the following variables.

  • Response variable

  • diabetes_5y: Indicator variable where \(= 1\) indicates development of type 2 diabetes by year 5 after treatment

  • Treatment variable

  • nutrition_control: Indicator variable where \(= 1\) indicates Nutrition change treatment

  • Covariates,

  • sex: Indicator variable where \(= 1\) indicates male

  • diabetes_family: Indicator variable where \(= 1\) indicates presence of type 2 diabetes in the family

  • obese_family: Indicator variable where \(= 1\) indicates presence of obesity (BMI>30) in the family

  • alcohol_consume: Indicator variable where \(= 1\) indicates regular consumption of alcohol

  • cardio-vascular_disease: Indicator variable where \(= 1\) indicates presence of cardio-vascular disease

  • sedentary: Indicator variable where \(= 1\) indicates sedentary work and lifestyle

  • age: Age at treatment begin

  • number_md_5y: Number of referring physicians over 5 years

  • bmi: Body Mass Index at treatment begin

  • mbp: Mean blood pressure at treatment begin (average between systolic and diastolic blood pressure)

Below, we load the data as a .csv file, and rename the response and the treatment variable to \(Y\) and \(W\), respectively.

# Clear any existing variables
rm(list = ls())

# Set seed for reproducibility

# Load data
data_raw <- read.csv('diabetes_nutrition.csv')

# These are the covariates we'll use
variables_names <- c("number_md_5y", "bmi", "mbp", "age", "hospital")
binary_variables_names <- c("sex",
                            "diabetes_family", "obese_family", 
                            "alcohol_consumption", "cardio_vascular_disease", "sedentary_work")
covariates <- c(variables_names, binary_variables_names)
all_variables_names <- c(covariates, "nutrition_control", "diabetes_5y")

# Extracting indicator variables
binary_covariates <- data_raw[,binary_variables_names]

# Extracting outcome and treatment
outcome <- data_raw$diabetes_5y
treatment <- data_raw$nutrition_control

covariates <- data_raw[ ,variables_names]

# Setting up the data, renaming columns
df <- data.frame(covariates, binary_covariates, W=treatment, Y=outcome)

df$hospital <- as.factor(df$hospital)

Recall: Average Treatment Effect (ATE)

Let us briefly formalize our goal. We observe a sequence of triples \(\{(W_i, Y_i, X_i)\}_{i}^{N}\), where \(W_i\) represents whether subject \(i\) was “treated” with a Nutrition plan, \(Y_i\) is a binary variable representing whether they developed diabetes within 5 years, and \(X_i\) is a vector of other observable characteristics. Moreover, in the Neyman-Rubin potential-outcomes framework, we will denote by \(Y_i(1)\) the random variable that represents the potential outcome of subject \(i\) had they received the treatment, and \(Y_i(0)\) will represent the same potential outcome had they not received anything. The individual treatment effect for subject \(i\) can then be written as

\[Y_i(1) - Y_i(0)\]

Unfortunately, in our data we can only observe one of these two potential outcomes. Computing this difference for each individual is impossible. But we will try to use the information we have about the distribution of the data to say something about its average, called the average treatment effect (ATE) and denoted here by \(\tau\):

\[\tau := E[Y_i(1) - Y_i(0)]\]

Now, what method will work here depends on our assumptions about the data-generating process. In this tutorial, we will always assume for simplicity that the data is independently and identically distributed (iid).

In an experimental setting, we would also assume that the potential outcomes are independent of the treatment:

\[Y_i(1), Y_i(0) \ \perp \ W_i\]

In plain English, this would mean assuming that whether or not a subject was advised a Nutrition change has nothing to do with how they would respond to this “treatment”.

Since we have the information that the data is confounded, this assumption is violated. Indeed, it is likely that people who are more aware of the risk of type 2 diabetes and therefore open to change their lifestyle are more likely to receive the treatment.

The independence assumption above would allow us to produce a simple estimator for the ATE:

\[\begin{align} \tau &= E[\tau_{i}] \\ &= E[Y_i(1) - Y_i(0)] \\ &= E[Y_i(1)] - E[Y_i(0)] \qquad \qquad \because \text{Linearity of expectations}\\ &= E[Y_i(1)|W_i = 1] - E[Y_i(0)|W_i = 0] \qquad \because \text{Independence assumption} \\ &= E[Y_i|W_i = 1] - E[Y_i|W_i = 0] \qquad \because \text{Consistence} \\ \end{align}\]

In words, the math above states that if we wanted to know the estimate of the average treatment effect we would just need to know the diabetes prevalence for treated and control subjects and compute their difference.

The implied estimator would be:

\[\hat{\tau} = \frac{1}{n_1}\sum_{i | W_i = 1} y_i - \frac{1}{n_0}\sum_{i | W_i = 0} Y_{i}\]

where \(n_1\) and \(n_0\) are the numbers of subjects in the treatment and control groups, respectively. Under above independence assumption, the following snippet would consistently estimate the average treatment effect and also give its confidence interval.

difference_in_means <- function(dataset) {
  treated_idx <- which(dataset$W == 1)
  control_idx <- which(dataset$W == 0)
  # Filter treatment / control observations, pulls outcome variable as a vector
  y1 <- dataset[treated_idx, "Y"] # Outcome in treatment grp
  y0 <- dataset[control_idx, "Y"] # Outcome in control group
  n1 <- sum(dataset[,"W"])     # Number of obs in treatment
  n0 <- sum(1 - dataset[,"W"]) # Number of obs in control
  # Difference in means is ATE
  tauhat <- mean(y1) - mean(y0)
  # 95% Confidence intervals
  se_hat <- sqrt( var(y0)/(n0) + var(y1)/(n1) )
  lower_ci <- tauhat - 1.96 * se_hat
  upper_ci <- tauhat + 1.96 * se_hat
  return(c(ATE = tauhat, lower_ci = lower_ci, upper_ci = upper_ci))

tauhat_naive <- difference_in_means(df)

Look at the output:

##         ATE    lower_ci    upper_ci 
## -0.00597090 -0.02719679  0.01525499

Assessing treatment bias in the data

We have the information that the true ATE is -0.09, but from the naive estimation we would conclude that the treatment has no effect. This is due to the treatment bias in the data, i.e., the violation of the independence assumption recalled above.

A look at the contingency table of treatment and outcome can give a first hint.

  1. Using the function table, prop.table (with the option margin if needed) comment the difference between old and new distributions. Estimate in the new data P(Treatment), P(Diabetes | Treatment), P(Diabetes | Control) and comment.
res.table <- table(df$W==1, df$Y==1, 
                         dnn = c("Treated","Diabetes"))
##        Diabetes
## Treated      FALSE       TRUE
##   FALSE 0.23771252 0.50056672
##   TRUE  0.08583205 0.17588872

You can also use the option margin in prop.table.

prop.table(res.table, margin=1)
##        Diabetes
## Treated     FALSE      TRUE
##   FALSE 0.3219819 0.6780181
##   TRUE  0.3279528 0.6720472

Here we retrieve the treatment effect \(\bar Y_1-\bar Y_0= 0.6720472-0.6780181\). When we interpret a contingency table, if the percentage sum in rows, you compare the columns and vice-versa.

From the above table we can also deduce the following estimates for the population characteristics:

Diabetes prevalence: 67.6 %

Estimated P(Treatment): 0.26

P(Diabetes | Treatment): 0.67

P(Diabetes | Control): 0.68

These quantities suggest that the treatment has a no or even a deteriorating impact on the chances of developing type 2 diabetes.

We know however from the person who generated the data, that this is not the case and that the treatment has a beneficial impact.

A way to assess the imbalance between the treatment and control groups is to compute and visualize the standardized mean differences in quantitative variables between the treated and the controls.

It is often done in the medical community using the package cobalt:

#xvars <- names(which(sapply(df_mod, is.numeric)==TRUE))
xvars <- colnames(df[, sapply(df, is.numeric)])
xvars <- xvars[!xvars %in% c("W", "Y")]
tab <- CreateTableOne(vars=xvars, strat="W", data=df[, sapply(df, is.numeric)], test=FALSE)
print(tab, smd=TRUE)
##                                      Stratified by W
##                                       0             1             SMD   
##   n                                    7165          2540               
##   number_md_5y (mean (SD))             1.88 (0.72)   1.98 (0.75)   0.132
##   bmi (mean (SD))                     31.92 (3.58)  31.97 (3.60)   0.014
##   mbp (mean (SD))                     35.77 (10.98) 36.52 (11.28)  0.068
##   age (mean (SD))                     47.76 (13.70) 46.26 (15.04)  0.104
##   sex (mean (SD))                      0.42 (0.49)   0.76 (0.43)   0.746
##   diabetes_family (mean (SD))          0.88 (0.33)   0.84 (0.37)   0.108
##   obese_family (mean (SD))             0.86 (0.34)   0.75 (0.43)   0.289
##   alcohol_consumption (mean (SD))      0.27 (0.44)   0.25 (0.43)   0.043
##   cardio_vascular_disease (mean (SD))  0.41 (0.49)   0.40 (0.49)   0.015
##   sedentary_work (mean (SD))           0.52 (0.50)   0.13 (0.34)   0.897
balance <-[,xvars], treat = df$W, estimand="ATE", continuous="std")
love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.order = "unadjusted", threshold = 0.1, cex=0.8)

This plot gives the standardized mean differences before and after weighting, i.e., the difference in means between the two groups divided by the pooled standard deviation. For a variable \(X_j\), this corresponds to:

\[SMD(X_j) = \frac{\overline{X_j^{(1)}} - \overline{X_j^{(0)}}}{\sqrt{\frac{(s^{(1)})^{2}+(s^{(0)})^{2}}{2}}}\] This SMD can be interpreted as a difference in means in standard deviation units.

Note that for binary variables, the SMD is defined via sample probabilities: if variable \(X_k\) is binary (assume \(X_k\in\{0,1\}\)), denote by \(p_k^{(w)}\) the probability \(Pr(X_{ik} = 1)\) in treatment group \(w\). Then \[SMD(X_k) = \frac{\widehat{p_k}^{(1)} - \widehat{p_k}^{(0)}}{\sqrt{\frac{\widehat{p_k}^{(1)}(1-\widehat{p_k}^{(1)})+\widehat{p_k}^{(0)}(1-\widehat{p_k}^{(0)})}{2}}} \]

The standardized mean differences (SMD) higher than some threshold th, for instance th=0.1, indicate that the covariate distributions in the two groups differ: the treatment is not given uniformly at random. This explains the need for some adjustment or balancing in order to perform a causal analysis of the treatment on a certain outcome.

Unsurprisingly, the two groups differ on the set of variables that we used earlier to identify likely and unlikely diabetes patients.

Similarly, you can use the function catdes of the FactoMineR package which is dedicated to the description of a categorical variable by other variables. This function is often used in clustering to quickly describe clusters.

df_Q <- df
df_Q[, "W"] <- as.factor(df_Q[, "W"])
indW <- which(colnames(df_Q)=="W") <- catdes(df_Q, indW, proba = 0.05)
plot(, level = 0.05)

For more information on catdes, you can look at the supplementary files. In addition, you can use FactoShiny:


To be able to estimate ATE from observational data, we need additional assumptions.

The first one is known as unconfoundedness, and it is formalized as a conditional independence assumption as follows.

\[Y_i(1), Y_i(0) \perp W_i \ | \ X_i\]

Unconfoundedness implies that the treatment is randomly assigned within each subpopulation indexed by \(X_i = x\). Alternatively, it means that once we know all observable characteristics of individual \(i\), then knowing about his or her treatment status gives us no extra information about their potential outcomes.

The second assumption will be called here the overlap assumption. It is formally stated like as follows.

\[\forall x \in \text{supp}(X), \qquad 0 < P(W = 1 \ | \ X = x) < 1\]

Effectively, this assumption guarantees that no subpopulation indexed by \(X=x\) is entirely located in only one of control or treatment groups. It is necessary to ensure that we are able to compare individuals in control and treatment for every subpopulation.

Finally, the conditional probability of treatment given controls \(P(W=1|X=x)\) is called the propensity score. In the next sections, it will play a central role in estimation and inference of causal effects. The propensity score can be estimated by any methods you prefer, and it’s always a good idea to check that the estimated propensity score satisfies the overlap assumption. For example, let’s estimate it using a logistic regression.

  1. Computing the propensity score by logistic regression of W on X and comment on the overlap assumption using an appropriate representation.
# Computing the propensity score by logistic regression of W on X. <- glm(W ~ ., data = dplyr::select(df, -c("Y")), family = binomial(link="logit"))
p_logistic <- as.vector(predict(, type = "response"))
# Not to forget, type = response is to get the probabilities $p_i = exp(X\beta)/1+exp(X\beta)$ otherwise, R outputs the logit(p_i)

hist(p_logistic, main = "Histogram: Logistic Regression Propensity Scores"
     , xlab = "Propensity Score", col = "cornflowerblue", freq = FALSE)