Abstract

In this lab, you will learn how to apply several methods for the estimation of causal effects from observational data. In order to know when our methods give correct answers, we will use a synthetic dataset, where we know the average treatment effect since we control the generation of the data. On this dataset, we will use several techniques to retrieve the correct answer - and see the methods that are inconsistent on observational data. At the end of the lab, you will have experimented with a new set of tools to use in causal inference problems, and have idea of which methods give more reliable answers.

```
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, verbose = FALSE, message = FALSE)
# Clear any existing variables
rm(list = ls())
# Load all packages needed to execute the job
# If the packages are not installed, write
# install.packages("<name of package>")
library(ggplot2) # plot
library(glmnet) # lasso
library(grf) # generalized random forests
library(sandwich) # for robust CIs
library(tableone) # compute standardized mean differences (SMDs)
library(cobalt) # plot SMDs
library(FactoMineR) # visualization and pca methods, FactoMineR website http://factominer.free.fr/ and youtube chanel https://www.youtube.com/user/HussonFrancois
library(ranger)
library(Matching) # for IPW estimates with valid std errors
library(survey) # for robust standard errors
```

This dataset is simulated and the variable names and relationships are not necessarily representing a realistic scenario. However, to facilitate commenting and interpreting the results, we use plausible variable names. We will assume that our goal is to study the **impact of nutrition change on development of Type 2 diabetes after a prediabetes diagnosis**.

The source of the data informs us that the true effect on this population is -0.09 (since these data are synthetically generated, it is indeed possible to have this information. Another possibility, in a real world setting, would be to conduct a randomized controlled trial on the population of interest.)

In this tutorial, we will use the following variables.

Response variable

*diabetes_5y*: Indicator variable where \(= 1\) indicates development of type 2 diabetes by year 5 after treatmentTreatment variable

*nutrition_control*: Indicator variable where \(= 1\) indicates*Nutrition change*treatmentCovariates,

*sex*: Indicator variable where \(= 1\) indicates male*diabetes_family*: Indicator variable where \(= 1\) indicates presence of type 2 diabetes in the family*obese_family*: Indicator variable where \(= 1\) indicates presence of obesity (BMI>30) in the family*alcohol_consume*: Indicator variable where \(= 1\) indicates regular consumption of alcohol*cardio-vascular_disease*: Indicator variable where \(= 1\) indicates presence of cardio-vascular disease*sedentary*: Indicator variable where \(= 1\) indicates sedentary work and lifestyle*age*: Age at treatment begin*number_md_5y*: Number of referring physicians over 5 years*bmi*: Body Mass Index at treatment begin*mbp*: Mean blood pressure at treatment begin (average between systolic and diastolic blood pressure)

Below, we load the data as a .csv file, and rename the response and the treatment variable to \(Y\) and \(W\), respectively.

```
# Clear any existing variables
rm(list = ls())
# Set seed for reproducibility
set.seed(20191111)
# Load data
data_raw <- read.csv('diabetes_nutrition.csv')
# These are the covariates we'll use
variables_names <- c("number_md_5y", "bmi", "mbp", "age", "hospital")
binary_variables_names <- c("sex",
"diabetes_family", "obese_family",
"alcohol_consumption", "cardio_vascular_disease", "sedentary_work")
covariates <- c(variables_names, binary_variables_names)
all_variables_names <- c(covariates, "nutrition_control", "diabetes_5y")
# Extracting indicator variables
binary_covariates <- data_raw[,binary_variables_names]
# Extracting outcome and treatment
outcome <- data_raw$diabetes_5y
treatment <- data_raw$nutrition_control
covariates <- data_raw[ ,variables_names]
# Setting up the data, renaming columns
df <- data.frame(covariates, binary_covariates, W=treatment, Y=outcome)
df$hospital <- as.factor(df$hospital)
```

Let us briefly formalize our goal. We observe a sequence of triples \(\{(W_i, Y_i, X_i)\}_{i}^{N}\), where \(W_i\) represents whether subject \(i\) was “treated” with a *Nutrition* plan, \(Y_i\) is a binary variable representing whether they developed diabetes within 5 years, and \(X_i\) is a vector of other observable characteristics. Moreover, in the Neyman-Rubin potential-outcomes framework, we will denote by \(Y_i(1)\) the *random variable* that represents the **potential outcome** of subject \(i\) had they received the treatment, and \(Y_i(0)\) will represent the same potential outcome had they not received anything. The **individual treatment effect** for subject \(i\) can then be written as

\[Y_i(1) - Y_i(0)\]

Unfortunately, in our data we can only observe one of these two potential outcomes. Computing this difference for each individual is impossible. But we will try to use the information we have about the distribution of the data to say something about its average, called the **average treatment effect (ATE)** and denoted here by \(\tau\):

\[\tau := E[Y_i(1) - Y_i(0)]\]

Now, what method will work here depends on our assumptions about the data-generating process. In this tutorial, we will always assume for simplicity that the data is independently and identically distributed (*iid*).

In an experimental setting, we would also assume that the potential outcomes are independent of the treatment:

\[Y_i(1), Y_i(0) \ \perp \ W_i\]

In plain English, this would mean assuming that whether or not a subject was advised a *Nutrition* change has nothing to do with how they would respond to this “treatment”.

Since we have the information that the data is confounded, this assumption is violated. Indeed, it is likely that people who are more aware of the risk of type 2 diabetes and therefore open to change their lifestyle are more likely to receive the treatment.

The independence assumption above would allow us to produce a simple estimator for the ATE:

\[\begin{align} \tau &= E[\tau_{i}] \\ &= E[Y_i(1) - Y_i(0)] \\ &= E[Y_i(1)] - E[Y_i(0)] \qquad \qquad \because \text{Linearity of expectations}\\ &= E[Y_i(1)|W_i = 1] - E[Y_i(0)|W_i = 0] \qquad \because \text{Independence assumption} \\ &= E[Y_i|W_i = 1] - E[Y_i|W_i = 0] \qquad \because \text{Consistence} \\ \end{align}\]

In words, the math above states that if we wanted to know the estimate of the average treatment effect we would just need to know the diabetes prevalence for treated and control subjects and compute their difference.

The implied estimator would be:

\[\hat{\tau} = \frac{1}{n_1}\sum_{i | W_i = 1} y_i - \frac{1}{n_0}\sum_{i | W_i = 0} Y_{i}\]

where \(n_1\) and \(n_0\) are the numbers of subjects in the treatment and control groups, respectively. Under above independence assumption, the following snippet would consistently estimate the average treatment effect and also give its confidence interval.

```
difference_in_means <- function(dataset) {
treated_idx <- which(dataset$W == 1)
control_idx <- which(dataset$W == 0)
# Filter treatment / control observations, pulls outcome variable as a vector
y1 <- dataset[treated_idx, "Y"] # Outcome in treatment grp
y0 <- dataset[control_idx, "Y"] # Outcome in control group
n1 <- sum(dataset[,"W"]) # Number of obs in treatment
n0 <- sum(1 - dataset[,"W"]) # Number of obs in control
# Difference in means is ATE
tauhat <- mean(y1) - mean(y0)
# 95% Confidence intervals
se_hat <- sqrt( var(y0)/(n0) + var(y1)/(n1) )
lower_ci <- tauhat - 1.96 * se_hat
upper_ci <- tauhat + 1.96 * se_hat
return(c(ATE = tauhat, lower_ci = lower_ci, upper_ci = upper_ci))
}
tauhat_naive <- difference_in_means(df)
```

Look at the output:

`print(tauhat_naive)`

```
## ATE lower_ci upper_ci
## -0.00597090 -0.02719679 0.01525499
```

We have the information that the true ATE is -0.09, but from the naive estimation we would conclude that the treatment has no effect. This is due to the treatment bias in the data, i.e., the violation of the independence assumption recalled above.

A look at the contingency table of treatment and outcome can give a first hint.

- Using the function table,
`prop.table`

(with the option margin if needed) comment the difference between old and new distributions. Estimate in the new data P(Treatment), P(Diabetes | Treatment), P(Diabetes | Control) and comment.

```
res.table <- table(df$W==1, df$Y==1,
dnn = c("Treated","Diabetes"))
prop.table(res.table)
```

```
## Diabetes
## Treated FALSE TRUE
## FALSE 0.23771252 0.50056672
## TRUE 0.08583205 0.17588872
```

You can also use the option `margin`

in `prop.table`

.

`prop.table(res.table, margin=1)`

```
## Diabetes
## Treated FALSE TRUE
## FALSE 0.3219819 0.6780181
## TRUE 0.3279528 0.6720472
```

Here we retrieve the treatment effect \(\bar Y_1-\bar Y_0= 0.6720472-0.6780181\). When we interpret a contingency table, if the percentage sum in rows, you compare the columns and vice-versa.

From the above table we can also deduce the following estimates for the population characteristics:

Diabetes prevalence: 67.6 %

Estimated P(Treatment): 0.26

P(Diabetes | Treatment): 0.67

P(Diabetes | Control): 0.68

These quantities suggest that the treatment has a no or even a deteriorating impact on the chances of developing type 2 diabetes.

We know however from the person who generated the data, that this is not the case and that the treatment has a beneficial impact.

A way to assess the imbalance between the treatment and control groups is to compute and visualize the standardized mean differences in quantitative variables between the treated and the controls.

It is often done in the medical community using the package cobalt:

```
#xvars <- names(which(sapply(df_mod, is.numeric)==TRUE))
xvars <- colnames(df[, sapply(df, is.numeric)])
xvars <- xvars[!xvars %in% c("W", "Y")]
tab <- CreateTableOne(vars=xvars, strat="W", data=df[, sapply(df, is.numeric)], test=FALSE)
print(tab, smd=TRUE)
```

```
## Stratified by W
## 0 1 SMD
## n 7165 2540
## number_md_5y (mean (SD)) 1.88 (0.72) 1.98 (0.75) 0.132
## bmi (mean (SD)) 31.92 (3.58) 31.97 (3.60) 0.014
## mbp (mean (SD)) 35.77 (10.98) 36.52 (11.28) 0.068
## age (mean (SD)) 47.76 (13.70) 46.26 (15.04) 0.104
## sex (mean (SD)) 0.42 (0.49) 0.76 (0.43) 0.746
## diabetes_family (mean (SD)) 0.88 (0.33) 0.84 (0.37) 0.108
## obese_family (mean (SD)) 0.86 (0.34) 0.75 (0.43) 0.289
## alcohol_consumption (mean (SD)) 0.27 (0.44) 0.25 (0.43) 0.043
## cardio_vascular_disease (mean (SD)) 0.41 (0.49) 0.40 (0.49) 0.015
## sedentary_work (mean (SD)) 0.52 (0.50) 0.13 (0.34) 0.897
```

```
balance <- bal.tab(df[,xvars], treat = df$W, estimand="ATE", continuous="std")
love.plot(x = balance, stat = "mean.diffs", abs = TRUE, var.order = "unadjusted", threshold = 0.1, cex=0.8)
```

This plot gives the standardized mean differences before and after weighting, i.e., the difference in means between the two groups divided by the pooled standard deviation. For a variable \(X_j\), this corresponds to:

\[SMD(X_j) = \frac{\overline{X_j^{(1)}} - \overline{X_j^{(0)}}}{\sqrt{\frac{(s^{(1)})^{2}+(s^{(0)})^{2}}{2}}}\] This SMD can be interpreted as a difference in means in standard deviation units.

Note that for binary variables, the SMD is defined via sample probabilities: if variable \(X_k\) is binary (assume \(X_k\in\{0,1\}\)), denote by \(p_k^{(w)}\) the probability \(Pr(X_{ik} = 1)\) in treatment group \(w\). Then \[SMD(X_k) = \frac{\widehat{p_k}^{(1)} - \widehat{p_k}^{(0)}}{\sqrt{\frac{\widehat{p_k}^{(1)}(1-\widehat{p_k}^{(1)})+\widehat{p_k}^{(0)}(1-\widehat{p_k}^{(0)})}{2}}} \]

The standardized mean differences (SMD) higher than some threshold th, for instance th=0.1, indicate that the covariate distributions in the two groups differ: the treatment is not given uniformly at random. This explains the need for some adjustment or balancing in order to perform a causal analysis of the treatment on a certain outcome.

Unsurprisingly, the two groups differ on the set of variables that we used earlier to identify likely and unlikely diabetes patients.

Similarly, you can use the function `catdes`

of the `FactoMineR`

package which is dedicated to the description of a categorical variable by other variables. This function is often used in clustering to quickly describe clusters.

```
df_Q <- df
df_Q[, "W"] <- as.factor(df_Q[, "W"])
indW <- which(colnames(df_Q)=="W")
res.cat <- catdes(df_Q, indW, proba = 0.05)
plot(res.cat, level = 0.05)
```

For more information on `catdes`

, you can look at the supplementary files. In addition, you can use `FactoShiny`

: https://www.youtube.com/watch?v=JAsvTf-8cXo&list=PLnZgp6epRBbQ3UELPgndosV4dNYpT9l1-&index=4&t=0s

To be able to estimate ATE from observational data, we need additional assumptions.

The first one is known as **unconfoundedness**, and it is formalized as a conditional independence assumption as follows.

\[Y_i(1), Y_i(0) \perp W_i \ | \ X_i\]

Unconfoundedness implies that the treatment is randomly assigned within each subpopulation indexed by \(X_i = x\). Alternatively, it means that once we know all observable characteristics of individual \(i\), then knowing about his or her treatment status gives us no extra information about their potential outcomes.

The second assumption will be called here the **overlap assumption**. It is formally stated like as follows.

\[\forall x \in \text{supp}(X), \qquad 0 < P(W = 1 \ | \ X = x) < 1\]

Effectively, this assumption guarantees that no subpopulation indexed by \(X=x\) is entirely located in only one of control or treatment groups. It is necessary to ensure that we are able to compare individuals in control and treatment for every subpopulation.

Finally, the conditional probability of treatment given controls \(P(W=1|X=x)\) is called the **propensity score**. In the next sections, it will play a central role in estimation and inference of causal effects. The propensity score can be estimated by any methods you prefer, and it’s always a good idea to check that the estimated propensity score satisfies the overlap assumption. For example, let’s estimate it using a logistic regression.

- Computing the propensity score by logistic regression of W on X and comment on the overlap assumption using an appropriate representation.

```
# Computing the propensity score by logistic regression of W on X.
p_logistic.fit <- glm(W ~ ., data = dplyr::select(df, -c("Y")), family = binomial(link="logit"))
p_logistic <- as.vector(predict(p_logistic.fit, type = "response"))
# Not to forget, type = response is to get the probabilities $p_i = exp(X\beta)/1+exp(X\beta)$ otherwise, R outputs the logit(p_i)
hist(p_logistic, main = "Histogram: Logistic Regression Propensity Scores"
, xlab = "Propensity Score", col = "cornflowerblue", freq = FALSE)
```