Exploring Causal Inference in Observational Studies

Julie Josse

Inria

Imke Mayer

EHESS

Credit: Susan Athey¹

Stanford University

January 2021

Abstract

In this lab, you will learn how to apply several methods for the estimation of causal effects from observational data. In order to know when our methods give correct answers, we will use a synthetic dataset, where we know the average treatment effect since we control the generation of the data. On this dataset, we will use several techniques to retrieve the correct answer - and see the methods that are inconsistent on observational data. At the end of the lab, you will have experimented with a new set of tools to use in causal inference problems, and have idea of which methods give more reliable answers.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, verbose = FALSE, message = FALSE)

# Clear any existing variables
rm(list = ls())

# Load all packages needed to execute the job
# If the packages are not installed, write
# install.packages("<name of package>")

library(ggplot2)    # plot
library(glmnet)     # lasso
library(grf)        # generalized random forests
library(sandwich)   # for robust CIs
library(tableone)   # compute standardized mean differences (SMDs)
library(cobalt)     # plot SMDs
library(FactoMineR) # visualization and pca methods, FactoMineR website http://factominer.free.fr/ and youtube chanel https://www.youtube.com/user/HussonFrancois
library(ranger)
library(Matching)   # for IPW estimates with valid std errors
library(survey)     # for robust standard errors

Problem and data description

This dataset is simulated and the variable names and relationships are not necessarily representing a realistic scenario. However, to facilitate commenting and interpreting the results, we use plausible variable names. We will assume that our goal is to study the impact of nutrition change on development of Type 2 diabetes after a prediabetes diagnosis.

The source of the data informs us that the true effect on this population is -0.09 (since these data are synthetically generated, it is indeed possible to have this information. Another possibility, in a real world setting, would be to conduct a randomized controlled trial on the population of interest.)

In this tutorial, we will use the following variables.

Response variable
diabetes_5y: Indicator variable where \(= 1\) indicates development of type 2 diabetes by year 5 after treatment
Treatment variable
nutrition_control: Indicator variable where \(= 1\) indicates Nutrition change treatment
Covariates,
sex: Indicator variable where \(= 1\) indicates male
diabetes_family: Indicator variable where \(= 1\) indicates presence of type 2 diabetes in the family
obese_family: Indicator variable where \(= 1\) indicates presence of obesity (BMI>30) in the family
alcohol_consume: Indicator variable where \(= 1\) indicates regular consumption of alcohol
cardio-vascular_disease: Indicator variable where \(= 1\) indicates presence of cardio-vascular disease
sedentary: Indicator variable where \(= 1\) indicates sedentary work and lifestyle
age: Age at treatment begin
number_md_5y: Number of referring physicians over 5 years
bmi: Body Mass Index at treatment begin
mbp: Mean blood pressure at treatment begin (average between systolic and diastolic blood pressure)

Below, we load the data as a .csv file, and rename the response and the treatment variable to \(Y\) and \(W\), respectively.

# Clear any existing variables
rm(list = ls())

# Set seed for reproducibility
set.seed(20191111)

# Load data
data_raw <- read.csv('diabetes_nutrition.csv')


# These are the covariates we'll use
variables_names <- c("number_md_5y", "bmi", "mbp", "age", "hospital")
binary_variables_names <- c("sex",
                            "diabetes_family", "obese_family", 
                            "alcohol_consumption", "cardio_vascular_disease", "sedentary_work")
covariates <- c(variables_names, binary_variables_names)
all_variables_names <- c(covariates, "nutrition_control", "diabetes_5y")

# Extracting indicator variables
binary_covariates <- data_raw[,binary_variables_names]

# Extracting outcome and treatment
outcome <- data_raw$diabetes_5y
treatment <- data_raw$nutrition_control

covariates <- data_raw[ ,variables_names]

# Setting up the data, renaming columns
df <- data.frame(covariates, binary_covariates, W=treatment, Y=outcome)

df$hospital <- as.factor(df$hospital)

Recall: Average Treatment Effect (ATE)

Let us briefly formalize our goal. We observe a sequence of triples \(\{(W_i, Y_i, X_i)\}_{i}^{N}\), where \(W_i\) represents whether subject \(i\) was “treated” with a Nutrition plan, \(Y_i\) is a binary variable representing whether they developed diabetes within 5 years, and \(X_i\) is a vector of other observable characteristics. Moreover, in the Neyman-Rubin potential-outcomes framework, we will denote by \(Y_i(1)\) the random variable that represents the potential outcome of subject \(i\) had they received the treatment, and \(Y_i(0)\) will represent the same potential outcome had they not received anything. The individual treatment effect for subject \(i\) can then be written as

\[Y_i(1) - Y_i(0)\]

Unfortunately, in our data we can only observe one of these two potential outcomes. Computing this difference for each individual is impossible. But we will try to use the information we have about the distribution of the data to say something about its average, called the average treatment effect (ATE) and denoted here by \(\tau\):

\[\tau := E[Y_i(1) - Y_i(0)]\]

Now, what method will work here depends on our assumptions about the data-generating process. In this tutorial, we will always assume for simplicity that the data is independently and identically distributed (iid).

In an experimental setting, we would also assume that the potential outcomes are independent of the treatment:

\[Y_i(1), Y_i(0) \ \perp \ W_i\]

In plain English, this would mean assuming that whether or not a subject was advised a Nutrition change has nothing to do with how they would respond to this “treatment”.

Since we have the information that the data is confounded, this assumption is violated. Indeed, it is likely that people who are more aware of the risk of type 2 diabetes and therefore open to change their lifestyle are more likely to receive the treatment.

The independence assumption above would allow us to produce a simple estimator for the ATE:

\[\begin{align} \tau &= E[\tau_{i}] \\ &= E[Y_i(1) - Y_i(0)] \\ &= E[Y_i(1)] - E[Y_i(0)] \qquad \qquad \because \text{Linearity of expectations}\\ &= E[Y_i(1)|W_i = 1] - E[Y_i(0)|W_i = 0] \qquad \because \text{Independence assumption} \\ &= E[Y_i|W_i = 1] - E[Y_i|W_i = 0] \qquad \because \text{Consistence} \\ \end{align}\]

In words, the math above states that if we wanted to know the estimate of the average treatment effect we would just need to know the diabetes prevalence for treated and control subjects and compute their difference.

The implied estimator would be:

\[\hat{\tau} = \frac{1}{n_1}\sum_{i | W_i = 1} y_i - \frac{1}{n_0}\sum_{i | W_i = 0} Y_{i}\]

where \(n_1\) and \(n_0\) are the numbers of subjects in the treatment and control groups, respectively. Under above independence assumption, the following snippet would consistently estimate the average treatment effect and also give its confidence interval.

difference_in_means <- function(dataset) {
  treated_idx <- which(dataset$W == 1)
  control_idx <- which(dataset$W == 0)
  
  # Filter treatment / control observations, pulls outcome variable as a vector
  y1 <- dataset[treated_idx, "Y"] # Outcome in treatment grp
  y0 <- dataset[control_idx, "Y"] # Outcome in control group
  
  n1 <- sum(dataset[,"W"])     # Number of obs in treatment
  n0 <- sum(1 - dataset[,"W"]) # Number of obs in control
  
  # Difference in means is ATE
  tauhat <- mean(y1) - mean(y0)
  
  # 95% Confidence intervals
  se_hat <- sqrt( var(y0)/(n0) + var(y1)/(n1) )
  lower_ci <- tauhat - 1.96 * se_hat
  upper_ci <- tauhat + 1.96 * se_hat
  
  return(c(ATE = tauhat, lower_ci = lower_ci, upper_ci = upper_ci))
}

tauhat_naive <- difference_in_means(df)

Look at the output:

print(tauhat_naive)

##         ATE    lower_ci    upper_ci 
## -0.00597090 -0.02719679  0.01525499

Assessing treatment bias in the data

We have the information that the true ATE is -0.09, but from the naive estimation we would conclude that the treatment has no effect. This is due to the treatment bias in the data, i.e., the violation of the independence assumption recalled above.

A look at the contingency table of treatment and outcome can give a first hint.

Using the function table, prop.table (with the option margin if needed) comment the difference between old and new distributions. Estimate in the new data P(Treatment), P(Diabetes | Treatment), P(Diabetes | Control) and comment.

res.table <- table(df$W==1, df$Y==1, 
                         dnn = c("Treated","Diabetes"))
prop.table(res.table)

##        Diabetes
## Treated      FALSE       TRUE
##   FALSE 0.23771252 0.50056672
##   TRUE  0.08583205 0.17588872

You can also use the option margin in prop.table.

prop.table(res.table, margin=1)

##        Diabetes
## Treated     FALSE      TRUE
##   FALSE 0.3219819 0.6780181
##   TRUE  0.3279528 0.6720472

Here we retrieve the treatment effect \(\bar Y_1-\bar Y_0= 0.6720472-0.6780181\). When we interpret a contingency table, if the percentage sum in rows, you compare the columns and vice-versa.

From the above table we can also deduce the following estimates for the population characteristics:

Diabetes prevalence: 67.6 %

Estimated P(Treatment): 0.26

P(Diabetes | Treatment): 0.67

P(Diabetes | Control): 0.68

These quantities suggest that the treatment has a no or even a deteriorating impact on the chances of developing type 2 diabetes.

We know however from the person who generated the data, that this is not the case and that the treatment has a beneficial impact.

A way to assess the imbalance between the treatment and control groups is to compute and visualize the standardized mean differences in quantitative variables between the treated and the controls.

It is often done in the medical community using the package cobalt:

#xvars <- names(which(sapply(df_mod, is.numeric)==TRUE))
xvars <- colnames(df[, sapply(df, is.numeric)])
xvars <- xvars[!xvars %in% c("W", "Y")]
tab <- CreateTableOne(vars=xvars, strat="W", data=df[, sapply(df, is.numeric)], test=FALSE)
print(tab, smd=TRUE)

##                                      Stratified by W
##                                       0             1             SMD   
##   n                                    7165          2540               
##   number_md_5y (mean (SD))             1.88 (0.72)   1.98 (0.75)   0.132
##   bmi (mean (SD))                     31.92 (3.58)  31.97 (3.60)   0.014
##   mbp (mean (SD))                     35.77 (10.98) 36.52 (11.28)  0.068
##   age (mean (SD))                     47.76 (13.70) 46.26 (15.04)  0.104
##   sex (mean (SD))                      0.42 (0.49)   0.76 (0.43)   0.746
##   diabetes_family (mean (SD))          0.88 (0.33)   0.84 (0.37)   0.108
##   obese_family (mean (SD))             0.86 (0.34)   0.75 (0.43)   0.289
##   alcohol_consumption (mean (SD))      0.27 (0.44)   0.25 (0.43)   0.043
##   cardio_vascular_disease (mean (SD))  0.41 (0.49)   0.40 (0.49)   0.015
##   sedentary_work (mean (SD))           0.52 (0.50)   0.13 (0.34)   0.897

balance <- bal.tab(df[,xvars], treat = df$W, estimand="ATE", continuous="std")
love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.order = "unadjusted", threshold = 0.1, cex=0.8)

This plot gives the standardized mean differences before and after weighting, i.e., the difference in means between the two groups divided by the pooled standard deviation. For a variable \(X_j\), this corresponds to:

\[SMD(X_j) = \frac{\overline{X_j^{(1)}} - \overline{X_j^{(0)}}}{\sqrt{\frac{(s^{(1)})^{2}+(s^{(0)})^{2}}{2}}}\] This SMD can be interpreted as a difference in means in standard deviation units.

Note that for binary variables, the SMD is defined via sample probabilities: if variable \(X_k\) is binary (assume \(X_k\in\{0,1\}\)), denote by \(p_k^{(w)}\) the probability \(Pr(X_{ik} = 1)\) in treatment group \(w\). Then \[SMD(X_k) = \frac{\widehat{p_k}^{(1)} - \widehat{p_k}^{(0)}}{\sqrt{\frac{\widehat{p_k}^{(1)}(1-\widehat{p_k}^{(1)})+\widehat{p_k}^{(0)}(1-\widehat{p_k}^{(0)})}{2}}} \]

The standardized mean differences (SMD) higher than some threshold th, for instance th=0.1, indicate that the covariate distributions in the two groups differ: the treatment is not given uniformly at random. This explains the need for some adjustment or balancing in order to perform a causal analysis of the treatment on a certain outcome.

Unsurprisingly, the two groups differ on the set of variables that we used earlier to identify likely and unlikely diabetes patients.

Similarly, you can use the function catdes of the FactoMineR package which is dedicated to the description of a categorical variable by other variables. This function is often used in clustering to quickly describe clusters.

df_Q <- df
df_Q[, "W"] <- as.factor(df_Q[, "W"])
indW <- which(colnames(df_Q)=="W")
res.cat <- catdes(df_Q, indW, proba = 0.05)
plot(res.cat, level = 0.05)

For more information on catdes, you can look at the supplementary files. In addition, you can use FactoShiny: https://www.youtube.com/watch?v=JAsvTf-8cXo&list=PLnZgp6epRBbQ3UELPgndosV4dNYpT9l1-&index=4&t=0s

Assumptions

To be able to estimate ATE from observational data, we need additional assumptions.

The first one is known as unconfoundedness, and it is formalized as a conditional independence assumption as follows.

\[Y_i(1), Y_i(0) \perp W_i \ | \ X_i\]

Unconfoundedness implies that the treatment is randomly assigned within each subpopulation indexed by \(X_i = x\). Alternatively, it means that once we know all observable characteristics of individual \(i\), then knowing about his or her treatment status gives us no extra information about their potential outcomes.

The second assumption will be called here the overlap assumption. It is formally stated like as follows.

\[\forall x \in \text{supp}(X), \qquad 0 < P(W = 1 \ | \ X = x) < 1\]

Effectively, this assumption guarantees that no subpopulation indexed by \(X=x\) is entirely located in only one of control or treatment groups. It is necessary to ensure that we are able to compare individuals in control and treatment for every subpopulation.

Finally, the conditional probability of treatment given controls \(P(W=1|X=x)\) is called the propensity score. In the next sections, it will play a central role in estimation and inference of causal effects. The propensity score can be estimated by any methods you prefer, and it’s always a good idea to check that the estimated propensity score satisfies the overlap assumption. For example, let’s estimate it using a logistic regression.

Computing the propensity score by logistic regression of W on X and comment on the overlap assumption using an appropriate representation.

# Computing the propensity score by logistic regression of W on X.
p_logistic.fit <- glm(W ~ ., data = dplyr::select(df, -c("Y")), family = binomial(link="logit"))
p_logistic <- as.vector(predict(p_logistic.fit, type = "response"))
# Not to forget, type = response is to get the probabilities $p_i = exp(X\beta)/1+exp(X\beta)$ otherwise, R outputs the logit(p_i)

hist(p_logistic, main = "Histogram: Logistic Regression Propensity Scores"
     , xlab = "Propensity Score", col = "cornflowerblue", freq = FALSE)

sort(p_logistic)[1:10]

##  [1] 0.001656747 0.001684003 0.001812350 0.001854947 0.003096697 0.003232322
##  [7] 0.004487475 0.006120444 0.007148527 0.007450633

plot(density(p_logistic[df$W==1]), main = "Logistic Propensity Scores for treated"
      , xlab = "Propensity Score")
lines(density(p_logistic[df$W==0]), main = "Logistic Propensity Scores for treated"
     , xlab = "Propensity Score", col = 2)

We observe that the propensity scores are not equally distributed between treated and control groups, but there seems to be overlap.

Also, we can visually check the overlap assumption by remarking that the probability density stays bounded away from zero and one. Notice that, in this example, propensities get quite close to 0 – so caution is needed.

Regression adjustment, IPW and matching

Now, we will compare traditional methods for ATE estimation under unconfoundedness: inverse-propensity weighting via logistic regression/random forests and propensity score matching and regression adjustment

Propensity score estimation

Rosenbaum and Rubin (1983) have shown that whenever unconfoundedness holds, it is sufficient to control for the propensity score. The propensity score serves a single-dimensional variable that summarizes how observables affect the treatment probability.

\[e(x) = P(W_i = 1 \ |\ X_i = x)\] In terms of conditional independence, one can prove that if unconfoundedness holds, then

\[Y_i(1), Y_i(0) \perp W_i \ | \ e(X_i)\] That is, a comparison of two people with the same propensity score, one of whom received the treatment and one who did not, should in principle adjust for confounding variables.

Here, let’s compute the propensity score by running a logistic regression of \(W\) on covariates \(X\). Later, we will try different methods.

Propensity score matching

The idea of matching is very intuitive and relates to the RCT: for every subject, we try to find a “similar” subject in the other treatment group. This way, the matched subjects will have balanced covariate distributions, allowing to estimate the ATE as we would in an RCT.

There exist different choices to match the observations, such as one-to-one matching and one-to-many matching. The choice of distance to define “similarity” between observations is crucial. A popular choice is matching on the propensity score, i.e., instead of defining a distance on the covariate space, one simply takes the squared distance (Euclidean distance).

To avoid matching observations too far away from each other, a possibility is to define a caliper, i.e., a maximum acceptable distance for matching. Typically, this matching caliper is chosen to be \(0.2\times SD(logit(PS))\). Note that if there are many observations with a possible match (from the opposite group) outside the caliper range, this might indicate a violation of the overlap assumption.

The Matching package can be used for propensity score matching. The logit of propensity score is often used as the matching scale.

Look up the documentation of the Match function from the Matching package and write a function that returns the ATE estimate (+ confidence interval) based on matching on the PS in logit scale.

You can read this reference for more suggestions on Matching.

ate_matching_ps <- function(dataset, p){
  listMatch = Match(Y = dataset$Y,
                    Tr = (dataset$W == 1), # needs to be in {0,1}
                    X = log(p /(1-p)), # logit of PS as matching scale
                    Z = dplyr::select(dataset, -c("W","Y")), # covariates
                    M = 1, # 1:1 matching
                    caliper  = 0.2, # caliper = 0.2 * SD(logit(PS))
                    estimand = "ATE")
  tauhat = listMatch$est
  lower_ci = tauhat - 1.96*listMatch$se
  upper_ci = tauhat + 1.96*listMatch$se
  return(list(c(ATE = tauhat, lower_ci = lower_ci, upper_ci = upper_ci), 
              listMatch$index.treated, 
              listMatch$index.control))
}

matching_results <- ate_matching_ps(df, p_logistic)
tauhat_logistic_match <- matching_results[[1]]
index.treated <- matching_results[[2]]
index.control <- matching_results[[3]]
print(tauhat_logistic_match)

##         ATE    lower_ci    upper_ci 
## -0.10244914 -0.13539474 -0.06950353

Note: The standard error takes into account the variability due to the matching procedure but not the variability due to the propensity score estimation.

We can plot the absolute SMD to examine the covariate balance on the matched sample.

df_matched_quant <- df[c(index.treated, index.control),
                           which(sapply(df, is.numeric))]
balance <- bal.tab(df_matched_quant[,which(!(colnames(df_matched_quant) %in% c("W","Y")))], treat = df_matched_quant$W==1, estimand="ATE", continuous="std")

love <- love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.names = NULL, var.order = "unadjusted", threshold = 0.1,  colors = c("cornflowerblue"))
love

You can ignore the legend Unadjusted in this plot since we gave the function the matched observations, so no further adjustment is necessary.

Inverse-propensity score weighting

Propensity score weighting (PSW) provides an alternative starting point as a method to reduce the effects of confounding in observational studies. The basic idea is to weight the observations to obtain similar baseline characteristics. The following results can be shown to hold

\[\mathbb{E}\left[Y_i{(1)}\right] = \mathbb{E}\left[\frac{Y_iW_i}{e(X_i)} \right] \quad \textrm{and} \quad \mathbb{E}\left[Y_i{(0)}\right] = \mathbb{E}\left[\frac{Y_i(1-W_i)}{1-e(X_i)} \right]\]

These expressions give us two estimators of the ATE. The first one is the sample analog of the difference between the two quantities above. \[\tau =\mathbb{E} \left[ \frac{Y_iW_i}{e(X_i)} - \frac{Y_i(1-W_i)}{1-e(X_i)} \right] = \mathbb{E} \left[ \frac{(W_i-e(X_i))}{e(X_i)(1-e(X_i))}Y_i \right]\] Using the propensity score that we just estimated above:

ipw <- function(dataset, p) {
  W <- dataset$W
  Y <- dataset$Y
  G <- ((W - p) * Y) / (p * (1 - p))
  tau.hat <- mean(G)
  se.hat <- sqrt(var(G) / (length(G)))
  # we can act as though we have an average of independent terms
  c(ATE=tau.hat, lower_ci = tau.hat - 1.96 * se.hat, upper_ci = tau.hat + 1.96 * se.hat)
  
  
  # # The following option computes the Hajek estimator,
  # # which corresponds to a variant of the above IPW estimator
  # # with "stabilized" weights.
  # W <- dataset$W
  # weights <- W / p + (1-W) / (1 - p)
  # # Allows to get standard errors for IPW estimator (following Lunceford&Davidian, 2004)
  # msm <- (svyglm(Y ~ W, design = svydesign(~ 1, weights = weights,
  #                                          data = dataset)))
  # 
  # tau.hat <- as.numeric(coef(msm)[2])
  # c(ATE=tau.hat, lower_ci = confint(msm)[2,1], upper_ci = confint(msm)[2,2])
  
}

tauhat_logistic_ipw <- ipw(df, p_logistic)
print(tauhat_logistic_ipw)

##         ATE    lower_ci    upper_ci 
##  0.06484325 -0.03641052  0.16609702

You can also compute the normalized IPW estimator

\[\hat{\tau}_{IPW, norm} = \sum_{i=1}^n \left(\frac{\frac{W_iY_i}{\hat{e}(X_i)}}{\sum_{i=1}^n W_i\hat{e}(X_i)} - \frac{\frac{(1-W_i)Y_i}{1-\hat{e}(X_i)}}{{\sum_{i=1}^n (1-W_i)\hat{e}(X_i)}}\right)\]

ipwnormal <-
 sum(df$Y*df$W/p_logistic)/sum(df$W/p_logistic) - sum(df$Y*(1-df$W)/(1-p_logistic))/sum((1-df$W)/(1-p_logistic))

print(ipwnormal)

## [1] -0.1236973

How are the estimated propensities distributed before and after weighting?

weights <- df$W/p_logistic + (1-df$W)/(1-p_logistic)

plot.before <- ggplot(as.data.frame(cbind(df, p_logistic, weights))) +
                   geom_density(aes(x=p_logistic, y=..scaled.., 
                                    colour = W==1, fill = W==1), alpha=0.5) +
                   ggtitle("Propensity scores before Weighting")

plot.after <- ggplot(as.data.frame(cbind(df, p_logistic, weights))) +
                   geom_density(aes(x=p_logistic, y = ..scaled.., weight = weights, 
                                    colour = W==1, fill = W==1), alpha=0.5) +
                   ggtitle("Propensity scores after Weighting")
plot.before

plot.after

Let us see how the inverse propensity weights balance the two groups

df_quant <- df[, which(sapply(df, is.numeric))]
balance <- bal.tab(df_quant[,which(!(colnames(df_quant) %in% c("W","Y")))], treat = df$W==1, weights = weights, method = "weighting", estimand="ATE", continuous="std", un=T)

love <- love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.names = NULL, var.order = "unadjusted", threshold = 0.1,  colors = c("black", "cornflowerblue"))
love

The mean differences are indeed reduced by the weighting. But keep in mind that small absolute mean differences do not necessarily mean that the groups are balanced since they only reflect univariate first order differences.

Random forests

We will now re-implement the same IPW estimator except we now use random forests to estimate the propensity score.

cf = ranger(as.factor(W == 1) ~ ., data = dplyr::select(df, -c("Y")),
            probability = TRUE)

p_rf = cf$predictions[,2]

hist(p_rf, main = "Histogram: Regression Forest Propensity Scores"  , xlab = "Propensity Score", col = "cornflowerblue", las = 1, freq = FALSE)

Compare these new p-score estimates to the logistic regression p-score estimates.

# these in fact look pretty different...
plot(x = p_rf, y = p_logistic
     , xlab = "Propensity Score (Random Forest)", ylab = "Propensity Score (Logistic Regression)"
     , col = adjustcolor("black", alpha.f=0.4), pch=19, las = 1)
abline(0, 1, lty = "dashed")

plot(density(p_rf), col = 2)
lines(density(p_logistic), col = 1)

How do these new propensity score estimates balance the two groups (by inverse propensity weighting)?

weights_rf <- df$W/p_rf + (1-df$W)/(1-p_rf)
plot.before <- ggplot(as.data.frame(cbind(df, p_rf, weights_rf))) +
                   geom_density(aes(x=p_rf, y=..scaled..,
                                    colour = W==1, fill = W==1), alpha=0.5) +
                   ggtitle("Propensity scores before Weighting")

plot.after <- ggplot(as.data.frame(cbind(df, p_rf, weights_rf))) +
                   geom_density(aes(x=p_rf, y = ..scaled.., weight = weights_rf,
                                    colour = W==1, fill = W==1), alpha=0.5) +
                   ggtitle("Propensity scores after Weighting")
plot.before

plot.after

balance <- bal.tab(df_quant[,which(!(colnames(df_quant) %in% c("W","Y")))], treat = df$W==1, weights = weights_rf, method = "weighting", estimand="ATE", continuous="std", un=T)

love <- love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.names = NULL, var.order = "unadjusted", threshold = 0.1,  colors = c("black", "cornflowerblue"))
love

These propensity score estimates seem to better balance the two groups, i.e., they capture the more information on the relationship between the covariates and the treatment assignment inducing the confounding.

matching_results <- ate_matching_ps(df, p_rf)
tauhat_rf_match <- matching_results[[1]]
print(tauhat_rf_match)

##         ATE    lower_ci    upper_ci 
## -0.08907428 -0.12352056 -0.05462800

tauhat_rf_ipw <- ipw(df, p_rf)
print(tauhat_rf_ipw)

##         ATE    lower_ci    upper_ci 
## -0.04367497 -0.10537972  0.01802978

Direct conditional mean estimation

We begin by defining the conditional average treatment effect (CATE), which we denote analogously to the ATE.

\[\tau(x) := E[Y_i(1) - Y_i(0) | X_i = x]\]

If we had access to estimates of \(CATE(x)\) for each \(x\), then we could also retrieve the population ATE, by simply averaging out over the regressors.

\[\tau = E[\tau(X)]\]

To find straightforward estimate of CATE, we follow a familiar reasoning, except this time we are conditioning on the observable features.

\[\begin{align} \tau(x) &= E[\tau_{i} \ | \ X_i = x] \\ &= E[Y_i(1) - Y_i(0) \ | \ X_i = x] \\ &= E[Y_i(1)|X] - E[Y_i(0) \ | \ X_i = x] \qquad \qquad \because \text{Linearity of expectations}\\ &= E[Y_i(1) \ | \ W_i = 1, X_i = x] - E[Y_i(0) \ | \ W_i = 0, X_i = x] \qquad \because \text{Unconfoundedness} \\ &=: \mu_1(x) - \mu_0(x)\\ &= E[Y \ | \ W_i = 1, X_i = x] - E[Y\ | \ W_i = 0, X_i = x] \qquad \because \text{Consistency} \end{align}\]

\(\mu_1(x)\) and \(\mu_0(x)\) are conditional expectations of the outcome variable for treatment and control groups. They can be estimated from observables. Here, we estimate them using a linear model.

difference_in_condmean_ols <- function(dataset) {
    dataset <- data.frame(sapply(dataset, as.numeric))
    n <- dim(dataset)[1]
    treated_idx <- which(dataset$W == 1)
    control_idx <- which(dataset$W == 0)
    
    # Filter treatment / control observations, pulls outcome variable as a vector
    y1 <- dataset[treated_idx, "Y"] # Outcome in treatment grp
    y0 <- dataset[control_idx, "Y"] # Outcome in control group
    
    n1 <- sum(dataset[,"W"])     # Number of obs in treatment
    n0 <- sum(1 - dataset[,"W"]) # Number of obs in control
    
    #dataset_spread <- fastDummies::dummy_cols(dataset, remove_selected_columns = T)
    dataset_spread <- dataset
    df0 <- dplyr::select(dataset_spread[control_idx,], -c("W"))
    df1 <- dplyr::select(dataset_spread[treated_idx,], -c("W"))
    
    mu0 <- lm(Y ~ ., data = df0)
    mu1 <- lm(Y ~ ., data = df1)
    
    # Difference in predicted means is ATE
    tauhat <- mean(predict(mu1, newdata = dplyr::select(dataset_spread, -c("W","Y"))) - predict(mu0, newdata = dplyr::select(dataset_spread, -c("W","Y"))))
    
    # 95% Confidence intervals
    beta0 <- mu0$coefficients[-1]
    beta0[which(is.na(beta0))] <- 0
    beta1 <- mu1$coefficients[-1]
    beta1[which(is.na(beta1))] <- 0
    se_hat <- sqrt( n*(var(y0)/(n0) + var(y1)/(n1)) - t(beta0+beta1)%*%var(dplyr::select(dataset_spread, -c("W","Y")))%*%(beta0+beta1))/sqrt(n)
  

  lower_ci <- tauhat - 1.96 * se_hat
  upper_ci <- tauhat + 1.96 * se_hat
  return(c(ATE = tauhat, lower_ci = lower_ci, upper_ci = upper_ci))
}

tauhat_ols <- difference_in_condmean_ols(df)
print(tauhat_ols)

##         ATE    lower_ci    upper_ci 
## -0.09028307 -0.11089534 -0.06967080

This estimator is also called G-formula estimator.

Summary (part 1)

Finally, let’s review all our estimators. Recall that, according to the person who generated the data, the true ATE is -0.09.

all_estimators <- rbind(
  Naive = tauhat_naive,
  Gformula = tauhat_ols,
  Match_logistic = tauhat_logistic_match,
  IPW_logistic = tauhat_logistic_ipw,
  Match_forest = tauhat_rf_match,
  IPW_forest = tauhat_rf_ipw)
all_estimators <- data.frame(all_estimators)
all_estimators$ci_length <- all_estimators$upper_ci - all_estimators$lower_ci
print(round(as.matrix(all_estimators), 3))

##                   ATE lower_ci upper_ci ci_length
## Naive          -0.006   -0.027    0.015     0.042
## Gformula       -0.090   -0.111   -0.070     0.041
## Match_logistic -0.102   -0.135   -0.070     0.066
## IPW_logistic    0.065   -0.036    0.166     0.203
## Match_forest   -0.089   -0.124   -0.055     0.069
## IPW_forest     -0.044   -0.105    0.018     0.123

Doubly robust methods

We have just seen the different methods for estimating causal effects. One modeled the conditional mean of outcomes given covariates and treatment, while the other balanced the sample using the propensity score for defining weights. When our model is correctly specified, either of these two approaches can give very strong performance guarantees. When there might be a risk of mis-specification, however, this is not true anymore and, in fact, the performance of any of these methods, by themselves, can be severely compromised.

The literature on doubly robust methods combines both regression and weighting in an attempt to ameliorate this sensitivity to mis-specification. As it turns out, a result of Scharfstein, Robins and Rotznitzky (1999) has shown that combining regression and weighting can lead to much more robust estimators: we only need one of the models for the conditional mean or propensity score to be correctly specified, the other one can be mis-specified.

One such estimator is given below. Note how we are using both the information about the conditional means and propensity score.

\[\tau = \mathbb{E} \left[ W_i \frac{Y_i-\mu(1,X_i)}{e(X_i)} + (1-W_i) \frac{Y_i-\mu(0,X_i)}{(1-e(X_i))} + \mu(1,X_i) - \mu(0,X_i)\right]\]

AIPW with linear-logistic

We can combine the OLS and logistic regression to form an augmented inverse-propensity weighted estimator.

aipw_ols <- function(dataset, p) {
  
  ols.fit = lm(Y ~ W * ., data = dataset)
  
  dataset.treatall = dataset
  dataset.treatall$W = 1
  treated_pred = predict(ols.fit, dataset.treatall)
  
  dataset.treatnone = dataset
  dataset.treatnone$W = 0
  control_pred = predict(ols.fit, dataset.treatnone)
  
 
  G <- treated_pred - control_pred +
    (dataset$W * (dataset$Y - treated_pred)) / p +  ((1-dataset$W) * (dataset$Y - control_pred))/(1-p)
  tau.hat <- mean(G)
  se.hat <- sqrt(var(G) / (length(G)))
  c(ATE=tau.hat, lower_ci = tau.hat - 1.96 * se.hat, upper_ci = tau.hat + 1.96 * se.hat)
}

tauhat_lin_logistic_aipw <- aipw_ols(df, p_logistic)
print(tauhat_lin_logistic_aipw)

##         ATE    lower_ci    upper_ci 
## -0.09852242 -0.15309906 -0.04394578

AIPW with random forest

We will now re-implement the same AIPW, except we now use random forests to estimate the regression adjustment and propensity score.

We will use the grf package, which already has built-in support for what we want to do. We start by training a causal forest. This will take a little bit of time.

Note that forests automatically do out-of-bag prediction when asked to predict on the training set, so we do not need to worry about cross-fitting here.

# We need the following line to recode the categorical variable hospital
X.m <- model.matrix(~.-1, data=dplyr::select(df, -c("W","Y"))) 

cf <- causal_forest( X.m, df$Y, as.numeric(as.character(df$W)), num.trees = 500)

The causal forest object includes propensity score estimates (estimated via a separate regression forest, as a sub-routine) in the object What.

The grf package has built-in support for AIPW.

ate_cf_aipw <- average_treatment_effect(cf)
tauhat_rf_aipw <- c(ATE=ate_cf_aipw["estimate"],
                   lower_ci=ate_cf_aipw["estimate"] - 1.96 * ate_cf_aipw["std.err"],
                   upper_ci=ate_cf_aipw["estimate"] + 1.96 * ate_cf_aipw["std.err"])
tauhat_rf_aipw

##      ATE.estimate lower_ci.estimate upper_ci.estimate 
##       -0.08843225       -0.11769149       -0.05917301

Summary (part 2)

We summarize the previously computed estimators in a plot:

As noted before, the naive estimator leads to a wrong conclusion about the treatment effect since it ignores the confounding (or treatment bias) in the data.

With one exception, all estimators suited for observational data with treatment bias, recover the true ATE.

Note: The failure of the IPW estimator using logistic regression for propensity score estimation might be the result of model mis-specification (i.e., the true model for the propensity score is not logistic) and extreme fitted propensity score values (very close to 0 or 1). The observations with these extreme values are likely discarded in the matching estimator and thus don’t distort this estimator. (You can check this by listing the observations with logistic regression propensity scores very close to 0 or 1 and the observations that are not matched by Match based on these propensity scores).

Contributors to Susan’s work: Stefan Wager, Nicolaj Nørgaard Mühlbach, Xinkun Nie, Vitor Hadad, Matthew Schaelling, Kaleb Javier, Niall Keleher.↩︎