First of all you will need to install the following packages
install.packages("VIM")
install.packages("naniar")
install.packages("missMDA")
install.packages("Amelia")
install.packages("mice")
install.packages("missForest")
install.packages("FactoMineR")
Air pollution is currently one of the most serious public health worries worldwide. Many epidemiological studies have proved the influence that some chemical compounds, such as sulphur dioxide (SO2), nitrogen dioxide (NO2), ozone (O3) can have on our health. Associations set up to monitor air quality are active all over the world to measure the concentration of these pollutants. They also keep a record of meteorological conditions such as temperature, cloud cover, wind, etc.
We have at our disposal 112 observations collected during the summer of 2001 in Rennes. The variables available are
Here the final aim is to analyse the relationship between the maximum daily ozone (maxO3) level and the other meteorological variables. To do so we will perform regression to explain maxO3 in function of all the other variables. This data is incomplete (there are missing values). Indeed, it occurs frenquently to have machines that fail one day, leading to some information not recorded. We will therefore perform regression with missing values via multiple imputation.
ozo <- read.table("ozoneNA.csv", header = TRUE,
sep=",", row.names = 1)
WindDirection <- ozo[, 12]
don <- ozo[, 1:11] #### keep the continuous variables
summary(don)
## maxO3 T9 T12 T15
## Min. : 42.00 Min. :11.30 Min. :14.30 Min. :14.90
## 1st Qu.: 71.00 1st Qu.:16.00 1st Qu.:18.60 1st Qu.:18.90
## Median : 81.50 Median :17.70 Median :20.40 Median :21.40
## Mean : 91.24 Mean :18.22 Mean :21.46 Mean :22.41
## 3rd Qu.:108.25 3rd Qu.:19.90 3rd Qu.:23.60 3rd Qu.:25.65
## Max. :166.00 Max. :25.30 Max. :33.50 Max. :35.50
## NA's :16 NA's :37 NA's :33 NA's :37
## Ne9 Ne12 Ne15 Vx9
## Min. :0.000 Min. :0.000 Min. :0.00 Min. :-7.8785
## 1st Qu.:3.000 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:-3.0000
## Median :5.000 Median :5.000 Median :5.00 Median :-0.8671
## Mean :4.987 Mean :4.986 Mean :4.60 Mean :-1.0958
## 3rd Qu.:7.000 3rd Qu.:7.000 3rd Qu.:6.25 3rd Qu.: 0.6919
## Max. :8.000 Max. :8.000 Max. :8.00 Max. : 5.1962
## NA's :34 NA's :42 NA's :32 NA's :18
## Vx12 Vx15 maxO3v
## Min. :-7.8785 Min. :-9.000 Min. : 42.00
## 1st Qu.:-3.6941 1st Qu.:-3.759 1st Qu.: 70.00
## Median :-1.9284 Median :-1.710 Median : 82.50
## Mean :-1.6853 Mean :-1.830 Mean : 89.39
## 3rd Qu.:-0.1302 3rd Qu.: 0.000 3rd Qu.:101.00
## Max. : 6.5778 Max. : 3.830 Max. :166.00
## NA's :10 NA's :21 NA's :12
head(don)
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9 Vx12 Vx15 maxO3v
## 20010601 87 15.6 18.5 NA 4 4 8 0.6946 -1.7101 -0.6946 84
## 20010602 82 NA NA NA 5 5 7 -4.3301 -4.0000 -3.0000 87
## 20010603 92 15.3 17.6 19.5 2 NA NA 2.9544 NA 0.5209 82
## 20010604 114 16.2 19.7 NA 1 1 0 NA 0.3473 -0.1736 92
## 20010605 94 NA 20.5 20.4 NA NA NA -0.5000 -2.9544 -4.3301 114
## 20010606 80 17.7 19.8 18.3 6 NA 7 -5.6382 -5.0000 -6.0000 94
dim(don)
## [1] 112 11
library(VIM)
library(FactoMineR)
library(missMDA)
Q1 When could it be a good idea to delete rows or columns with missing values to work with a complete data set? Could you do it here?
dim(na.omit(don))
## [1] 13 11
Deleting rows or columns is possible as long as there is enough data left and the missing values are of the MCAR type so that the sample is a subsample of the original data. We will obtain unbiased estimators but with more variance. Deleting observations with missing data for ozone data leads to a table with 13 rows.
First, we perfom some descriptive statistics (how many missing? how many variables, individuals with missing?) and try to inspect and vizualize the pattern of missing entries and get hints on the mechanism. For this purpose, we use the R package VIM (Visualization and Imputation of Missing Values - Mathias Templ) as well as Multiple Correspondence Analysis (FactoMineR package). You should install the package VIM, then you can check the documentation by executing
?VIM
The other package that can be used is the package naniar developped by Nick Tierney’s and which is based on ggplot. We will detail this package in a future version of the handout.
library(naniar)
gg_miss_var(don)
The function VIM aggr calculates and represents the number of missing entries in each variable and for certain combinations of variables (which tend to be missing simultaneously).
res<-summary(aggr(don, sortVar=TRUE))$combinations
##
## Variables sorted by number of missings:
## Variable Count
## Ne12 0.37500000
## T9 0.33035714
## T15 0.33035714
## Ne9 0.30357143
## T12 0.29464286
## Ne15 0.28571429
## Vx15 0.18750000
## Vx9 0.16071429
## maxO3 0.14285714
## maxO3v 0.10714286
## Vx12 0.08928571
head(res[rev(order(res[,2])),])
## Combinations Count Percent
## 1 0:0:0:0:0:0:0:0:0:0:0 13 11.607143
## 45 0:1:1:1:0:0:0:0:0:0:0 7 6.250000
## 10 0:0:0:0:0:1:0:0:0:0:0 5 4.464286
## 35 0:1:0:0:0:0:0:0:0:0:0 4 3.571429
## 41 0:1:0:0:1:1:1:0:0:0:0 3 2.678571
## 28 0:0:1:0:0:0:0:0:0:0:0 3 2.678571
We can see that the combination which is the most frequent is the one where all the variables are observed (13 values). Then, the second one is the one where T9, T12 and T15 are simultaneously missing (7 rows) (1 is missing, 0 is observed - there is a 1 for the second, third and fourth variables). The graph on the right panel represents the pattern, with blue for observed and red for missing.
The VIM function matrixplot creates a matrix plot in which all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme (gray scale), while missing/imputed data is visualized by a clearly distinguishable color (red). If you use Rstudio the plot is not interactive (there are the warnings), but if you use R directly, you can click on a column of your choice: the rows are sorted (decreasing order) of the values of this column. This is useful to check if there is an association between the value of a variable and the missingness of another one.
matrixplot(don, sortby = 2)
##
## Click in a column to sort by the corresponding variable.
## To regain use of the VIM GUI and the R console, click outside the plot region.
#Here the variable selected is variable 2.
Q2 Do you observe any associations between the missing entries ? When values are missing on a variable does it correspond to small or large values on another one ?
We observe that the temperature variables T9, T12 and T15 tend to be missing together (probably indicating that thermometers failed) [as well as the Ne9, Ne12 and Ne15 variables.] We see more “red” values. We do not see more black or white values which should imply that when T9 is missing it would have corresponded to high or low values in another variable which should suggest MAR missing values for instance. Here everything points to MCAR values.
The VIM function marginplot creates a scatterplot with additional information on the missing values. If you plot the variables (x,y), the points with no missing values are represented as in a standard scatterplot. The points for which x (resp. y) is missing are represented in red along the y (resp. x) axis. In addition, boxplots of the x and y variables are represented along the axes with and without missing values (in red all variables x where y is missing, in blue all variables x where y is observed).
marginplot(don[,c("T9","maxO3")])
We can see that the distribution of T9 is the same when maxO3 is oberved and when maxO3 is missing. If the two boxplots (red and blue) would have been very different it would imply that when maxO3 is missing the values of T9 can be very high or very low which lead to suspect the MAR hypothesis.
R1 Create a categorical dataset with “o” when the value of the cell is observed and “m” when it is missing, and with the same row and column names as in the original data. Then, you can perform Multiple Correspondence Analysis with the MCA function of the FactoMineR package.
?MCA
MCA can be seen as the counterpart of PCA for categorical data and here is used to study associations between missing and observed entries. MCA is a straightforwardly tool to visualise the missing data pattern even if the number of variable is large. It shows if missing values simultaneously occur in several variables or if missing values occur when some other variables are observed
data_miss <- data.frame(is.na(don))
data_miss <- apply(X=data_miss, FUN=function(x) if(x) "m" else "o", MARGIN=c(1,2))
res.mca <- MCA(data_miss, graph = F)
plot(res.mca, invis = "ind", title = "MCA graph of the categories", cex = 0.5)
Then, before modeling the data, we perform a PCA with missing values to explore the correlation between variables. Use the R package missMDA dedicated to perform principal components methods with missing values and to impute data with PC methods.
library(missMDA)
?estim_ncpPCA
?imputePCA
The package missMDA allows the use of principal component methods for an incomplete data set. To achieve this goal in the case of PCA, the missing values are predicted using the iterative PCA algorithm for a predefined number of dimensions. Then, PCA is performed on the imputed data set. The single imputation step requires tuning the number of dimensions used to impute the data.
nb <- estim_ncpPCA(don,method.cv = "Kfold", verbose = FALSE) # estimate the number of components from incomplete data
#(available methods include GCV to approximate CV)
nb$ncp #2
## [1] 2
plot(0:5, nb$criterion, xlab = "nb dim", ylab = "MSEP")
res.comp <- imputePCA(don, ncp = nb$ncp) # iterativePCA algorithm
res.comp$completeObs[1:3,] # the imputed data set
## maxO3 T9 T12 T15 Ne9 Ne12 Ne15 Vx9
## 20010601 87 15.6000 18.50000 20.47146 4 4.000000 8.000000 0.6946
## 20010602 82 18.5047 20.86986 21.79932 5 5.000000 7.000000 -4.3301
## 20010603 92 15.3000 17.60000 19.50000 2 3.984066 3.812104 2.9544
## Vx12 Vx15 maxO3v
## 20010601 -1.710100 -0.6946 84
## 20010602 -4.000000 -3.0000 87
## 20010603 1.950563 0.5209 82
imp <- cbind.data.frame(res.comp$completeObs,WindDirection)
res.pca <- PCA(imp, quanti.sup = 1, quali.sup = 12, ncp = nb$ncp, graph=FALSE)
plot(res.pca, hab=12, lab="quali");
plot(res.pca, choix="var")
head(res.pca$ind$coord) #scores (principal components)
## Dim.1 Dim.2
## 20010601 -0.6604580 -1.2048271
## 20010602 -1.2317545 1.0465411
## 20010603 0.7984643 -2.7299508
## 20010604 2.5423205 -1.7435774
## 20010605 -0.4047517 0.8406578
## 20010606 -2.6701824 1.6934864
The incomplete data set can be imputed using the function imputePCA performing the iterative PCA algorithm, specifying the number of dimensions through the argument ncp=2. At convergence the algorithm provides both an estimation of the scores and loadings as well as a completed data set. The imputePCA function outputs the imputed data set. The completed data set is in the object completeObs. The imputePCA function also outputs the fitted matrix \(\hat X\) in the object fitted.
Q3 Could you guess how cross-validation is performed to select the number of components?
The cross-validation is performed with the Kfold methodFor the Kfold. A percentage pNA of missing values is inserted and predicted with a PCA model using ncp.min to ncp.max dimensions. This process is repeated nbsim times. The number of components which leads to the smallest MSEP (Mean Standard Error of Prediction) is retained.
Through the argument method.cv, the function estim_ncpPCA proposes several cross-validation procedures to choose this number. The default method is the generalised cross-validation method (method.cv=“gcv”). It consists in searching the number of dimensions which minimises the generalised cross-validation criterion, which can be seen as an approximation of the leave-one-out cross-validation criterion. The procedure is very fast, because it does not require adding explicitly missing values and predicting them for each cell of the data set. However, the number of dimensions minimising the criterion can sometimes be unobvious when several local minimum occur. In such a case, more computationally intensive methods, those performing explicit cross-validation, can be used, such as Kfold (method.cv=“Kfold”) or leave-one-out (method.cv=“loo”).
The Kfold cross-validation suggests to retain 2 dimensions for the imputation of the data set.
We perform multiple imputation either assuming 1) Joint Modeling (one joint probabilistic model for the variables all together) - We use the R package Amelia, which is by default consider Gaussian distribution 2) Condional Modeling (one model per variable) approach - We use the R package mice which by default consider one model of linear regression per variable 3) a PCA based model - We use the R package missMDA
For each approach we generate 100 imputed data sets.
library(Amelia)
?amelia
res.amelia <- amelia(don, m = 5)
## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 141 142 143 144 145 146 147 148 149 150
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
## 161 162 163
#names(res.amelia$imputations)
#res.amelia$imputations$imp1# the first imputed data set
library(mice)
imp.mice <- mice(don, m = 100, defaultMethod = "norm.boot") # the variability of the parameters is obtained
?MIPCA
?plot.MIPCA
res.MIPCA <- MIPCA(don, ncp = 2, nboot = 100) # MI with PCA using 2 dimensions
The function MIPCA gives as output the data set imputed by the iterative PCA algorithm (in res.imputePCA) and the other data sets generated by the MIPCA algorithm (in res.MI). The number of data sets generated by this algorithm is controlled by the nboot argument, equal to 100 by default. The other arguments of this function are the same as those for the imputePCA function.
Exploratory analysis is very important and even at this stage of the analysis.
We will inspect the imputed values created to know if the imputation method should require more investigation or if we can continue and analyze the data. A common practice consists in comparing the distribution of the imputed values and of the observed values. Check the compare.density function.
compare.density(res.amelia, var = "T12")
Q Do both distributions need to be close? Could the missing values differ from the observed ones both in spread and in location?
Note that a difference between these distributions does not mean that the model is unsuitable. Indeed, when the missing data mechanism is not MCAR, it could make sense to observe differences between the distribution of imputed values and the distribution of observed values. However, if differences occur, more investigations would be required to try to explain them.
The quality of imputation can also be assessed with cross-validation using the overimpute function. Each observed value is deleted and for each one 100 values are predicted (using the same MI method) and the mean and 90% confidence intervals are computed for these 100 values. Then, we inspect whether the observed value falls within the obtained interval. On the graph, the y=x line is plotted (where the imputations should fall if they were perfect), as well as the mean (dots) and intervals (lines) for each value. Around ninety percent of these confidence intervals should contain the y = x line, which means that the true observed value falls within this range. The color of the line (as coded in the legend) represents the fraction of missing observations in the pattern of missingness for that observation (ex: blue=0-2 missing entries).
overimpute(res.amelia, var = "maxO3")
We can also examine the variability by projecting as supplementary tables the imputed data sets on the PCA configuration (plot the results of MI with PCA).
plot(res.MIPCA,choice= "ind.supp")
plot(res.MIPCA,choice= "var")
The plots represent the projection of the individuals (top) and variables (bottom) of each imputed data set as supplementary elements onto the reference configuration obtained with the iterative PCA algorithm. For the individuals, a confidence area is constructed for each, and if one has no missing entries, its confidence area is restricted to a point. All the plots show that the variability across different imputations is small and a user can interpret the PCA results with confidence.
MI aims to apply a statistical method on an incomplete data set. We now apply a regression model on each imputed data set of the amelia method and MIPCA methods.
resamelia <- lapply(res.amelia$imputations, as.data.frame)
# A regression on each imputed dataset
fitamelia<-lapply(resamelia, lm,
formula="maxO3~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")
# fitamelia <- lapply(resamelia, with,
# lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
imp.mice <- mice(don, m=100,defaultMethod="norm.boot") # the variability of the parameters is obtained
lm.mice.out <- with(imp.mice, lm(maxO3 ~ T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v))
res.MIPCA <- lapply(res.MIPCA$res.MI, as.data.frame)
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~T9+T12+T15+Ne9+Ne12+Ne15+Vx9+Vx12+Vx15+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia)
## est se t df Pr(>|t|)
## (Intercept) 1.2967211 47.7189035 0.02717416 2.291917 0.98050540
## T9 -3.4816239 9.0823339 -0.38334023 1.332799 0.75364562
## T12 8.2938004 6.4052436 1.29484543 1.981350 0.32573776
## T15 -2.0491882 2.7103986 -0.75604679 3.359014 0.49914252
## Ne9 -4.7502763 1.9105804 -2.48630012 3.964180 0.06831647
## Ne12 4.0891969 6.0883850 0.67163902 2.316358 0.56259768
## Ne15 -3.4427456 3.6138362 -0.95265680 3.014175 0.41074569
## Vx9 -2.7109677 6.4367322 -0.42117143 1.369208 0.73022368
## Vx12 3.2499194 3.8828048 0.83700303 2.783555 0.46834239
## Vx15 -1.1424535 1.7734345 -0.64420393 4.607524 0.55012776
## maxO3v 0.4597928 0.3001937 1.53165368 2.199070 0.25426826
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -180.8889745 183.4824166 NA 0.9700671 0.9518813
## T9 -68.8408095 61.8775616 NA 0.9893447 0.9802094
## T12 -19.5158311 36.1034319 NA 0.9778572 0.9630029
## T15 -10.1760810 6.0777046 NA 0.9286767 0.8959521
## Ne9 -10.0738549 0.5733024 NA 0.8965483 0.8548690
## Ne12 -18.9602231 27.1386169 NA 0.9693788 0.9509120
## Ne15 -14.9130488 8.0275576 NA 0.9445721 0.9169561
## Vx9 -47.0687993 41.6468638 NA 0.9888405 0.9794201
## Vx12 -9.6686601 16.1684989 NA 0.9538970 0.9295268
## Vx15 -5.8205238 3.5356168 NA 0.8598203 0.8098233
## maxO3v -0.7260121 1.6455978 NA 0.9725782 0.9554346
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
## est se t df Pr(>|t|)
## (Intercept) 12.0139412 18.52351188 0.6485779 61.59027 0.519022205
## T9 1.0906067 1.25093726 0.8718317 50.63007 0.387415853
## T12 1.6409326 0.98456860 1.6666514 58.14971 0.100961873
## T15 0.7125345 0.92587062 0.7695832 50.65107 0.445121411
## Ne9 -1.2333654 1.10536835 -1.1157959 66.34800 0.268534505
## Ne12 -1.5098382 1.63165356 -0.9253424 46.12283 0.359604288
## Ne15 0.2541235 1.18787207 0.2139317 59.49925 0.831331731
## Vx9 0.7513106 1.09329358 0.6871993 64.59788 0.494416995
## Vx12 1.0794458 1.22299952 0.8826216 60.45067 0.380937177
## Vx15 0.3236662 1.22141216 0.2649934 55.64118 0.791993774
## maxO3v 0.2542852 0.08687859 2.9269035 77.03304 0.004498089
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -25.01893199 49.0468145 NA 0.3530306 0.3323575
## T9 -1.42120021 3.6024137 NA 0.4554088 0.4343129
## T12 -0.32978757 3.6116527 NA 0.3843762 0.3635604
## T15 -1.14653926 2.5716083 NA 0.4552049 0.4341097
## Ne9 -3.44008849 0.9733577 NA 0.3106287 0.2901568
## Ne12 -4.79395049 1.7742740 NA 0.4999502 0.4787270
## Ne15 -2.12238576 2.6306328 NA 0.3720045 0.3512441
## Vx9 -1.43240517 2.9350263 NA 0.3261137 0.3055677
## Vx12 -1.36654240 3.5254340 NA 0.3633430 0.3426221
## Vx15 -2.12346461 2.7707970 NA 0.4076584 0.3867428
## maxO3v 0.08128906 0.4272814 NA 0.2179428 0.1978985
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
don2 <- don
reg <- lm(maxO3 ~. , data = don2)
while(any(summary(reg)$coeff[-1, 4]>0.05)){
don2 <- don2[,!(colnames(don2)%in%names(which.max(summary(reg)$coeff[-1, 4])))]
reg <- lm(maxO3 ~. , data = don2)
}
We combine the results and performed the regression with missing values
# Submodel to compare
fitMIPCA<-lapply(res.MIPCA,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolMIPCA<-pool(as.mira(fitMIPCA))
summary(poolMIPCA)
## est se t df Pr(>|t|)
## (Intercept) 10.3527704 14.05117837 0.7367902 68.20607 4.637768e-01
## T12 2.9418851 0.60586289 4.8556945 71.05469 6.897628e-06
## Ne9 -1.9236770 1.01793403 -1.8897855 68.37455 6.302666e-02
## Vx12 1.8474080 0.74518463 2.4791278 70.65831 1.555880e-02
## maxO3v 0.3294905 0.08035748 4.1003087 79.80010 9.862783e-05
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -17.6843857 38.3899265 NA 0.3254811 0.3059881
## T12 1.7338450 4.1499253 NA 0.3019111 0.2825344
## Ne9 -3.9547313 0.1073773 NA 0.3240788 0.3045927
## Vx12 0.3614275 3.3333884 NA 0.3051739 0.2857812
## maxO3v 0.1695678 0.4894131 NA 0.2309122 0.2118753
#lm.mice.out <- with(imp.mice, lm(maxO3 ~ T12+Ne9+Vx12+maxO3v))
#pool.mice <- pool(lm.mice.out)
#summary(pool.mice)
fitamelia<-lapply(resamelia,lm, formula="maxO3~ T12+Ne9+Vx12+maxO3v")
poolamelia<-pool(as.mira(fitamelia))
summary(poolamelia)
## est se t df Pr(>|t|)
## (Intercept) 16.1861693 9.17614482 1.763940 73.270679 8.191046e-02
## T12 3.0604401 0.46579504 6.570358 35.020664 1.375614e-07
## Ne9 -3.8459887 0.97459756 -3.946233 9.212779 3.222745e-03
## Vx12 1.0204257 0.65245773 1.563972 21.769876 1.322453e-01
## maxO3v 0.3346974 0.06964355 4.805864 25.759979 5.735396e-05
## lo 95 hi 95 nmis fmi lambda
## (Intercept) -2.1007229 34.4730614 NA 0.1322825 0.1089161
## T12 2.1148458 4.0060343 NA 0.2910342 0.2516699
## Ne9 -6.0429439 -1.6490335 NA 0.6530857 0.5851484
## Vx12 -0.3335186 2.3743700 NA 0.4055659 0.3533535
## maxO3v 0.1914781 0.4779167 NA 0.3632500 0.3156602
Ecolo <- read.csv("ecological.csv", header = TRUE, sep=";",dec=",")
## Delete species with only missing values for contiuous variables
ind <- which(rowSums(is.na(Ecolo[,-1])) == 6)
biome <- Ecolo[-ind,1] ### Keep a categorical variable
Ecolo <- Ecolo[-ind,-1] ### Select continuous variables
dim(Ecolo)
## [1] 2494 6
## proportion of missing values
sum(is.na(Ecolo))/(nrow(Ecolo)*ncol(Ecolo)) # 55% of missing values
## [1] 0.5338145
## Delete species with missing values
dim(na.omit(Ecolo)) # only 72 remaining species!
## [1] 72 6
# Visualize the pattern
library(VIM)
#aggr(Ecolo)
aggr(Ecolo,only.miss=TRUE,numbers=TRUE,sortVar=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Rmass 0.89013633
## LL 0.69967923
## Pmass 0.69847634
## Amass 0.69125902
## Nmass 0.17361668
## LMA 0.04971933
res <- summary(aggr(Ecolo,prop=TRUE,combined=TRUE))$combinations
#res[rev(order(res[,2])),]
mis.ind <- matrix("o",nrow=nrow(Ecolo),ncol=ncol(Ecolo))
mis.ind[is.na(Ecolo)] <- "m"
dimnames(mis.ind) <- dimnames(Ecolo)
library(FactoMineR)
resMCA <- MCA(mis.ind)
plot(resMCA,invis="ind",title="MCA graph of the categories")
### Impute the incomplete data set
library(missMDA)
### nb <- estim_ncpPCA(Ecolo,method.cv="Kfold",nbsim=100) ### Time consuming!
res.comp <- imputePCA(Ecolo,ncp=2)
#Perform a PCA on the completed data set
imp <- cbind.data.frame(res.comp$completeObs,biome)
res.pca <- PCA(imp,quali.sup=7,graph=FALSE)
plot(res.pca, hab=7, lab="quali")
plot(res.pca, hab=7, lab="quali",invisible="ind")
plot(res.pca, choix="var")