Handling missing values

Questions

  1. When you suggest methods to deal with missing values to users, the recurrent question is “What is the percentage of missing values that I can have in my data set, is 50% too much but 20% OK?” What is your answer to this question?

  2. Explain the aims of multiple imputation in comparison to single imputation.

  3. Your aim is to impute a data set (predict as well as possible the missing values). You have 3 imputation methods available. How could you compare them?

Continuous data with missing values - Regression with missing data via Multiple Imputation

First of all you need to install the following packages:

install.packages("VIM")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")

We work again on the ozone data set but an incomplete one. A quick reminder on the problematic. Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.

We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are

  • maxO3 (maximum daily ozone)
  • maxO3v (maximum daily ozone the previous day)
  • T12 (temperature at midday)
  • T9
  • T15 (Temp at 3pm)
  • Vx12 (projection of the wind speed vector on the east-west axis at midday)
  • Vx9 and Vx15 as well as the Nebulosity (cloud) Ne9, Ne12, Ne15

Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete (there are missing values). Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression via multiple imputation.

Import the data.

ozo <- read.table("ozoneNA.csv",header=TRUE,
sep=",", row.names=1)
WindDirection <- ozo[,12]
don <- ozo[,1:11]   #### keep the continuous variables
summary(don)
head(don)
dim(don)
  1. When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?

First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism that generated the missingness. For this purpose, we use the R package VIM (Visualization and Imputation of Missing Values - Mathias Templ) as well as Multiple Correspondence Analysis (FactoMineR package). The package VIM provides tools for the visualization of missing or imputed values, which can be used for exploring the data and the structure of the missing or imputed values.

The VIM function aggr calculates and plots the amount of missing entries in each variables and in some combinations of variables (that tend to be missing simultaneously).

res<-summary(aggr(don, sortVar=TRUE))$combinations
head(res[rev(order(res[,2])),])
aggr(don, sortVar=TRUE)

The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). If you use Rstudio the plot is not interactive (thus the warnings), but if you use R directly, you can click on a column of your choice, this will result in sorting the rows in the decreasing order of the values of this column.

matrixplot(don,sortby=2) 

The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).

marginplot(don[,c("T9","maxO3")])
  1. Explain why these plots could be useful and give what you outputs from these plots.

Another plots that can be useful before starting dealing with missing values is the one based on MCA. To do it, wecreate a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, we perform Multiple Correspondence Analysis to visualize the association with the MCA function.

?MCA
data_miss <- data.frame(is.na(don))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))
res.mca <- MCA(data_miss, graph = F)
plot(res.mca, invis = "ind", title = "MCA graph of the categories")

Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. We use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.

Determine the number of components ncp to keep using the estim_ncpPCA function. Perform PCA with missing values using the imputePCA function and ncp components. Then plot the correlation circle.

?estim_ncpPCA
?imputePCA
  1. Explain quickly how this analysis is performed.

  2. Could you guess how cross-validation is performed to select the number of components?

Then, to run the regression with missing values, we use Multiple Imputation. We impute the data either assuming 1) a Gaussian distribution (library Amelia) or 2) a PCA based model (library missMDA). Note that there are two ways to impute either using a Joint Modeling (one joint probabilitisc model for the variables all together) or a Condional Modeling (one model per variable) approach. We refer to the references given in the slides for more details. We use the R package Amelia. We generate 100 imputed data sets with the amelia method: ```

library(Amelia)
?amelia
res.amelia <- amelia(don, m=100)  
#names(res.amelia$imputations) 
res.amelia$imputations$imp1# the first imputed data set

Now generate 100 imputed data sets with the MIPCA method and 2 components. Store the result in a variable called res.MIPCA.

?MIPCA
?plot.MIPCA

Exploratory analysis is very important and even at this stage of the analysis.

  1. Suggest ways to inspect/vizualize the imputed values. Indeed, you may check that your imputed values seem to make sense.

  2. Check the compare.density function and apply it to compare the distributions of the T12 variable. Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?

?compare.density

The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).

?overimpute
overimpute(res.amelia, var="maxO3")

You can comment the quality of the imputation.

We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).

plot(res.MIPCA,choice= "ind.supp")
plot(res.MIPCA,choice= "var")

We now apply a regression model on each imputed data set of the amelia method.

resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm, 
                  formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")  
# fitamelia <- lapply(resamelia, with, 
#                     lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))

We do the same with the imputed datasets of the MIPCA method.

The package mice (Multiple Imputation by Chained Equations) allows to aggregate the results from some simple models according to Rubin’s rule.

library(mice)
# ?mice
# pool is a function from mice to aggregate the results according to Rubin's rule
# ?pool
poolamelia<-pool(as.mira(fitamelia)) 
summary(poolamelia)
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)

We could write a function that removes the variables with the largest pvalues step by step (each time a variable is removed the regression model is performed again) until all variables are significant.

don2 <- don
reg <- lm(maxO3 ~. , data = don2)
while(any(summary(reg)$coeff[-1, 4]>0.05)){
  don2 <- don2[,!(colnames(don2)%in%names(which.max(summary(reg)$coeff[-1, 4])))]
  reg <- lm(maxO3 ~. , data = don2)
  }
reg
  1. Interpret the results. What should be different from the same regression analysis obtained on one completed dataset?

EM algorithm for bivariate normal data with missing values

The purpose of this problem is to use the EM algorithm to estimate the mean of a bivariate normal dataset with missing entries in one of the two variables. We first generate synthetic data and then implement the EM algorithm to compute the estimator of the mean.

library(mvtnorm)

We consider a bivariate normal random variable \(y=\begin{pmatrix} y_1\\ y_2 \end{pmatrix}\) and denote the mean vector and covariance matrix of its distribution \(\mu=\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}\) and \(\Sigma=\begin{pmatrix} \sigma_{11} & \sigma_{12}\\ \sigma_{12} & \sigma_{22} \end{pmatrix}\): \(y\sim \mathcal{N}(\mu, \Sigma)\). We observe a sample of size \(n\) that contains some missing values in the variable \(y_2\), such that for some \(r\leq n\), we observe \((y_{i1},y_{i2})\) for \(i=1,..., r\) and \(y_{i1}\) for \(i=r+1,... n\). The goal is to estimate the mean \(\mu\). We will compare two strategies: 1) direct computation of the maximum likelihood estimator and 2) estimation of the mean with the EM algorithm.

Data generation

(R1) Generate a bivariate normal sample of size \(100\) of mean \(\begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix}=\begin{pmatrix} 5\\ -1 \end{pmatrix}\) and covariance matrix \(\begin{pmatrix} \sigma_{11} & \sigma_{12}\\ \sigma_{12} & \sigma_{22} \end{pmatrix}=\begin{pmatrix} 1.3 & 0.4\\ 0.4 & 0.9 \end{pmatrix}\) containing 30% of missing values in the variable \(y_2\).

Maximum likelihood estimator

We denote by \(f_{1,2}(y_1,y_2;\mu, \Sigma)\), \(f_1(y_1;\mu_1, \sigma_{11})\) and \(f_{2|1}(y_2|y_1; \mu, \Sigma)\) the probability density functions of the joint \((y_1,y_2)\), \(y_1\) and \(y_2|y_1\) respectively. The likelihood of the observed data can be written as \[f_{1,2}(y_1,y_2;\mu, \Sigma)=\prod_{i=1}^nf_1(y_{i1})\prod_{j=1}^rf_{2|1}(y_{j2}|y_{j1}),\] and the log-ikelihood is written (up to an additional constant that does not appear in the maximization and that we therefore drop)

\[l(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\sigma_{11}^2)-\frac{1}{2}\sum_{i=1}^n\frac{(y_{i1}-\mu_1)^2}{\sigma_{11}^2}-\frac{r}{2}\log((\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2)\] \[-\frac{1}{2}\sum_{i=1}^r\frac{(y_{i2}-\mu_2-\frac{\sigma_{12}}{\sigma_{11}}(y_{i1}-\mu_1))^2}{(\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2}\]

We skip the computations and directly give the expression of the closed form maximum likelihood estimator of the mean: \[ \hat{\mu}_1=n^{-1}\sum_{i=1}^ny_{i1} \] \[ \hat{\mu}_2=\hat{\beta}_{20.1}+\hat{\beta}_{21.1}\hat{\mu}_1,\]

\(\hat{\beta}_{21.1}=s_{12}/s_{11}\), \(\hat{\beta}_{20.1}=\bar{y}_2-\hat{\beta}_{21.1}\bar{y}_1\), and \(\bar{y}_j=r^{-1}\sum_{i=1}^ry_{ij}\), \(j=1,2\) and \(s_{jk}=r^{-1}\sum_{i=1}^r(y_{ij}-\bar{y}_j)(y_{ik}-\bar{y}_k)\), \(j,k=1,2\)

(R2) Compute the maximum likelihood estimates of \(\mu_1\) and \(\mu_2\).

EM algorithm

In this simple setting, we have an explicit expression of the maximum likelihood estimator despite missing values. However, this is not always the case but it is possible to use an EM algorithm which allows to get the maximum likelihood estimators in the cases where data are missing.

The EM algorithm consists in maximizing the “observed likelihood” \[l(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\sigma_{11}^2)-\frac{1}{2}\sum_{i=1}^n\frac{(y_{i1}-\mu_1)^2}{\sigma_{11}^2}-\frac{r}{2}\log((\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2)\] \[-\frac{1}{2}\sum_{i=1}^r\frac{(y_{i2}-\mu_2-\frac{\sigma_{12}}{\sigma_{11}}(y_{i1}-\mu_1))^2}{(\sigma_{22}-\frac{\sigma_{12}^2}{\sigma_{11}})^2},\] through successive maximization of the “complete likelihood” (if we had observed all \(n\) realizations of \(y_1\) and \(y_2\)). Maximizing the complete likelihood \[l_c(\mu, \Sigma|y_1,y_2)=-\frac{n}{2}\log(\det(\Sigma))-\frac{1}{2}\sum_{i=1}^n(y_{i1}-\mu_1)^T\Sigma^{-1}(y_{i1}-\mu_1)\]

would be straightforward if we had all the observations. However elements of this likelihood are not available. Consequently, we replace them by the conditional expectation given observed data and the parameters of the current iteration. These two steps of computation of the conditional expectation (E-step) and maximization of the completed likelihood (M step) are repeated until convergence.

The update formulas for the E and M steps are the following

E step:

The sufficient statistics of the likelihood are:

\[s_1=\sum_{i=1}^ny_{i1},\quad s_2=\sum_{i=1}^ny_{i2},\quad s_{11}=\sum_{i=1}^ny_{i1}^2,\quad s_{22}=\sum_{i=1}^ny_{i2}^2,\quad s_{12}=\sum_{i=1}^ny_{i1}y_{i2}.\]

Since some values of \(y_2\) are not available, we fill in the sufficient statistics with:

\[E[y_{i2}|y_{i1},\mu,\Sigma]=\beta_{20.1}+\beta_{21.1}y_{i1}\] \[E[y_{i2}^2|y_{i1},\mu,\Sigma]=(\beta_{20.1}+\beta_{21.1}y_{i1})^2+\sigma_{22.1}\] \[E[y_{i2}y_{i2}|y_{i1},\mu,\Sigma]=(\beta_{20.1}+\beta_{21.1}y_{i1})y_{i1}.\]

M step:

The M step consists in computing the maximum likelihood estimates as usual. Given \(s_1, s_2, s_{11}, s_{22}, \text{and } s_{12},\) update \(\hat{\mu}\) and \(\hat{\sigma}\) with \[\hat{\mu}_1=s_1/n\text{, }\hat{\mu}_2=s_2/n,\] \[\hat{\sigma}_1=s_{11}/n-\hat{\mu}_1^2\text{, }\hat{\sigma}_2=s_{22}/n-\hat{\mu}_2^2\text{, }\hat{\sigma}_{12}=s_{12}/n-\hat{\mu}_1\hat{\mu}_2\]

Note that \(s_1\), \(s_{11}\), \(\hat{\mu}_1\) and \(\hat{\sigma}_1\) are constant accross iterations since we do not have missing values on \(y_1\).

(R3) Write two functions called Estep and Mstep that respectively perform the E step and the M step. The Estep function can take as an input \(\mu\) and \(\Sigma\). Then, you can compute \(\beta_{21.1}=\sigma_{12}/\sigma_{11}\), \(\beta_{20.1}=\mu_2-\beta_{21.1}\mu_1\), and \(\sigma_{22.1}=\sigma_{22}-\sigma^2_{12}/\sigma_{11}\) and update the sufficient statistics \(s_{ij}\). The Mstep function consists in updating the update the \(\mu\) and \(\Sigma\) given the \(s_{ij}\).

(Q1) How could we initialize the algorithm ?

(R4) Implement a function called initEM that returns initial values for \(\hat{\mu}\) and \(\hat{\Sigma}\).

(R5) Implement the EM algorithm over 15 iterations and plot the value of \(\left\|\mu-\hat{\mu}\right\|^2\) over iterations. Comment your results briefly.

(R6) Check that the EM estimator \(\mu\) is equal to the maximum likelihood estimator.