Simultaneous imputation of multiple binary variables in R - r

I have a dataset with multiple correlated binary variables (0/1). Can anyone point me towards a solution how to impute completely random missing values based on information in the other variables?
Below, I provide some code to create a simplified dataset with just 3 correlated binary variables.
# create correlated random binary (0/1) variables
x1 <- runif(100,0,1) # N(0,1))
x2 <- x1 * runif(100,0,1) # N(0,1))
x3 <- x2 * runif(100,0,1)+0.2 # N(0,1))
x1 <- round(x1)
x2 <- round(x2)
x3 <- round(x3)
#introduce random missing (MCAR)
x1[seq(1,100,7)]<-NA
x2[seq(2,100,7)]<-NA
x3[seq(3,100,7)]<-NA
# how can I impute missing values in this dataframe?
df <- as.data.frame(cbind(x1,x2,x3))
cor(df,use="pairwise.complete.obs")
Thanks so much,
Micha

You could use the mice package.
> library(mice)
Loading required package: Rcpp
mice 2.21 2014-02-05
> df.imputed <- complete(mice(df))
# mice output deleted
> nrow(df) == sum(complete.cases(df.imputed))
[1] TRUE
> cor(df.imputed)
x1 x2 x3
x1 1.0000000 0.4645345 0.2914986
x2 0.4645345 1.0000000 0.6787420
x3 0.2914986 0.6787420 1.0000000

Related

How can I solve multicollinearity?

I constructed a linear model and tried to calculate the VIF of the variables but I get the following error:
vif(lm_model3101)
Error in vif.default(lm_model3101) :
there are aliased coefficients in the model
To check which numeric variables are corelated, i calculated the correlation of the used numeric variables and there is no perfect or nearly perfect correlation between any variables:
cor(multi)
mydata..CRU.Index. mydata..GDP.per.capita. mydata.price_per_unit mydata.price_discount mydata..AC..Volume.
mydata..CRU.Index. 1.000000000 0.006036169 0.1646463 -0.097077238 -0.006590327
mydata..GDP.per.capita. 0.006036169 1.000000000 0.1526220 0.008135387 -0.137733119
mydata.price_per_unit 0.164646319 0.152621974 1.0000000 -0.100344865 -0.310770525
mydata.price_discount -0.097077238 0.008135387 -0.1003449 1.000000000 0.339961760
mydata..AC..Volume. -0.006590327 -0.137733119 -0.3107705 0.339961760 1.000000000
What could the problem be? any help or suggestions? The rest of our explanatory variables are factorial so they can not be correlated
Having aliased coefficients doesn't necessarily mean two predictors are perfectly correlated. It means that they are linearly dependent, that is at least one terms is a linear combination of the others. They could be factors or continuous variables. To find them, use the alias function. For example:
y <- runif(10)
x1 <- runif(10)
x2 <- runif(10)
x3 <- x1 + x2
alias(y~x1+x2+x3)
Model :
y ~ x1 + x2 + x3
Complete :
(Intercept) x1 x2
x3 0 1 1
This identifies x3 as being the sum of x1 and x2

System is computationally singular due to small numbers in linearHypothesis

Ok, so here is the code that demonstrate the problem I am referring to:
x1 <- c(0.001, 0.002, 0.003, 0.0003)
x2 <- c(15000893, 23034340, 3034300, 232332242)
x3 <- c(1,3,5,6)
y <- rnorm( 4 )
model=lm( y ~ x1 + x2 )
model2=lm( y ~ x1 + x3 )
type <- "hc0"
V <- hccm(model, type=type)
sumry <- summary(model)
table <- coef(sumry)
table[,2] <- sqrt(diag(V))
table[,3] <- table[,1]/table[,2]
table[,4] <- 2*pt(abs(table[,3]), df.residual(model), lower.tail=FALSE)
sumry$coefficients <- table
p <- nrow(table)
hyp <- cbind(0, diag(p - 1))
linearHypothesis(model, hyp, white.adjust=type)
Note that this is not caused by perfect multicollinearity.
As you can see, I deliberately set the value of x2 to be very large and the value of x1 to be very small. When this happens, I cannot perform a linearHypothesis test of model=lm( y ~ x1 + x2 ) on all coefficients being 0: linearHypothesis(model, hyp, white.adjust=type). R will throw the following error:
> linearHypothesis(model, hyp, white.adjust=type)
Error in solve.default(vcov.hyp) :
system is computationally singular: reciprocal condition number = 2.31795e-23
However, when I use model2=lm( y ~ x1 + x3 ) instead, whose x3 is not too large compared to x1, the linearHypothesis test succeeds:
> linearHypothesis(model2, hyp, white.adjust=type)
Linear hypothesis test
Hypothesis:
x1 = 0
x3 = 0
Model 1: restricted model
Model 2: y ~ x1 + x3
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 3
2 1 2 11.596 0.2033
I am aware that this might be caused by the fact that R cannot invert matrices whose numbers are smaller than a certain extent, in this case 2.31795e-23. However, is there a way to circumvent that? Is this the limitation in R or the underlying C++?
What is the good practice here? The only method I can think of is to rescale the variables so that they are at the same scale. But I am also concerned about the amount of information I will lose by dividing everything by their standard errors.
In fact, I have 200 variables that are percentages, and 10 variables (including dependent variables) that are large (potentially to the 10^6 scale). It might be troubling to scale them one by one.

What to use in plsr and plsrglm to choose optimal number of components

I am using partial least square (PLS) regression in R, using the packages pls and plsRglm. I generate a dataframe as below, and used the two packages to perform PLS.
I chosed the optimal number of components with RMSEP in package pls, while with BIC with package plsRglm. Below are the R script.
x1 <- as.numeric(round(runif(10,-40,40),2))
x2 <- as.numeric(round(x1*1.4+60,2))
x3 <- as.numeric(round(runif(10,20,60),2))
x4 <- as.numeric(round(x2*0.9+60,2))
x5 <- as.numeric(round(x2*x3*0.9+60,2))
x6 <- as.numeric(round(x2*x3*x4*x5/1000000,2))
y <- as.numeric(round(runif(10,50,150),2))
df <- data.frame(y,x1,x2,x3,x4,x5,x6)
library(pls)
# plsr, RMSEP
mod.plsr <- plsr(y~x1+x2+x3+x4+x5+x6, data=df,
ncomp=5, validation="CV")
## delta vector contains RMSEP differences
err.CV = c()
for (i in 1:10) {err.CV[i] = RMSEP(mod.plsr)$val[i*2+1]}
delta = err.CV[1:9] - err.CV[2:10]
comp.plsr = min(which(delta<0.05))
plot(RMSEP(mod.plsr),legendpos="topright", main="")
## mixed model regression coefficients
mod.plsr.opt = plsr(y~x1+x2+x3+x4+x5+x6, data=df,
ncomp = comp.plsr)
coef(mod.plsr.opt)
, , 1 comps
y
x1 4.324635e-05
x2 6.054166e-05
x3 3.218208e-05
x4 5.449111e-05
x5 4.142277e-03
x6 4.653091e-03
library(plsRglm)
# plsrglm, BIC
mod.plsrglm = plsRglm(y~x1+x2+x3+x4+x5+x6, data=df,
nt=5, model="pls")
# use BIC to determine optimal number of components
comp.plsrglm = which(mod.plsrglm$InfCrit[,2] == min(mod.plsrglm$InfCrit[,2]))-1
# refit model and extract beta coefficients from the optimal model
mod.plsrglm.opt = plsRglm(y~x1+x2+x3+x4+x5+x6, data=df,
nt=comp.plsr, model="pls")
mod.plsrglm.opt$Coeffs
[,1]
Intercept -4.422569e+05
x1 -3.150225e+03
x2 -2.355536e+03
x3 4.523422e+00
x4 5.120661e+03
x5 -1.490321e-01
x6 7.920704e-02
I have several questions on these two different packages.
1) Can I produce RMSEP in plsRglm? Can I plot it as I did in plsr?
2) Should I used AIC or BIC in plsRglm to determine the optimal number of component?
3) Why the two packages are giving quite different results? Why plsRglm is giving an intercept coefficient but it is not in plsr?
Thank you.
3) try
coef(mod.plsr.opt, intercept = TRUE)
to get intercept in 'pls' package.

subset data frame according to a criteria based on correlation

I have a data frame and will like to create a function in which only variables with low correlation are keep. This means looking at the pairwise correlation of each variable with the rest of the variables, and for those variables in which at least one correlation coefficient is greater than 0.4 then this variable and the one highly correlated are taken out from the data frame.
For example suppose I have a data frame:
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
cor(data, use="pairwise.complete.obs")
x1 x2 x3 x4
x1 1.00000000 -0.3325757 0.08567911 0.2651721
x2 -0.33257569 1.0000000 -0.18761301 0.4660056
x3 0.08567911 -0.1876130 1.00000000 -0.3321003
x4 0.26517210 0.4660056 -0.33210031 1.0000000
Then I will like to return a data frame keeping only x1 and x3 (given that x2 and x4 have a correlation of 0.46)
Calculate the correlation matrix cd, checking if there is anything >0.4.
Then subset away, ignoring the diagonals, where row==col:
cd <- abs(cor(data, use="pairwise.complete.obs")) > 0.4
data[-unique(col(cd)[cd & row(cd) != col(cd)])]
You could try:
set.seed(50)
data <- data.frame(x1=rnorm(10), x2=rnorm(10), x3=runif(10), x4=runif(10,15,20))
mycor <- cor(data, use="pairwise.complete.obs")
data[, !apply(mycor, 2, function (x) max(x[-which.max(x)]) >.4 | min(x[which.min(x)]) < -.4) ]

matrix of correlations

I'm new to using R and I am trying to create a matrix of correlations. I have three independent variables (x1,x2,x3) and one dependent varaible (y).
I've been trying to use cor to make a matrix of the correlations, but so far I have bene unable to find a formula for doing this.
x1=rnorm(20)
x2=rnorm(20)
x3=rnorm(20)
y=rnorm(20)
data=cbind(y,x1,x2,x3)
cor(data)
If I have correctly understood, you have a matrix of 3 columns (say x1 to x3) and many rows (as y values). You may act as follows:
foo = matrix(runif(30), ncol=3) # creating a matrix of 3 columns
cor(foo)
If you have already your values in 3 vectors x1 to x3, you can make foo like this: foo=data.frame(x1,x2,x3)
Correct me if I'm wrong, but assuming this is related to a regression problem, this might be what you're looking for:
#Set the number of data points and build 3 independent variables
set.seed(0)
numdatpoi <- 7
x1 <- runif(numdatpoi)
x2 <- runif(numdatpoi)
x3 <- runif(numdatpoi)
#Build the dependent variable with some added noise
noisig <- 10
yact <- 2 + (3 * x1) + (5 * x2) + (10 * x3)
y <- yact + rnorm(n=numdatpoi, mean=0, sd=noisig)
#Fit a linear model
rmod <- lm(y ~ x1 + x2 + x3)
#Build the variance-covariance matrix. This matrix is typically what is wanted.
(vcv <- vcov(rmod))
#If needed, convert the variance-covariance matrix to a correlation matrix
(cm <- cov2cor(vcv))
From the above, here's the variance-covariance matrix:
(Intercept) x1 x2 x3
(Intercept) 466.5773 14.3368 -251.1715 -506.1587
x1 14.3368 452.9569 -170.5603 -307.7007
x2 -251.1715 -170.5603 387.2546 255.9756
x3 -506.1587 -307.7007 255.9756 873.6784
And, here's the associated correlation matrix:
(Intercept) x1 x2 x3
(Intercept) 1.00000000 0.03118617 -0.5908950 -0.7927735
x1 0.03118617 1.00000000 -0.4072406 -0.4891299
x2 -0.59089496 -0.40724064 1.0000000 0.4400728
x3 -0.79277352 -0.48912986 0.4400728 1.0000000

Resources