Issue with using foreach - r

I have a question regarding doing parallel simulation using foreach and I really appreciate to your help in advance as I have spent hours and hours and I keep getting errors.
I'm using foreach() for a simulation study where in each iteration I run a function that that function itself has some "regular" loops. When I run it sequentially, it works perfectly fine. On a linux server, when I run it using foreach in parallel inside R, it works fine but as soon as I run it in the batch mode using "nohup R CMD BATCH" command, it starts returning errors. I set the seed numbers manually for each simulation and it means that regardless of running them parallel inside R, parallel in the batch mode, and sequential inside R, my codes basically are run on the same simulated data so that if one way does not produce any error, others should not lead to error either but strangely, it's not the case. Has anyone had the same challenge?
I've spent hours and hours, putting exception handling but nothing happened.
Here is my code:
dataSim <- function(seedNum, n, mi, beta0FE, beta1, beta0RE){
# Setting the seed number:
set.seed(seedNum)
# Generating covariates:
x <- rnorm(n, mean = 0, sd = 1)
# Generating Y:
data <- data.frame(id = rep(1:n, each = mi), x = rep(x, each = mi),
beta0_RE = rep(beta0RE, each = mi))
etaTmp <- data$beta0_RE + beta0FE + beta1*data$x
piTmp <- exp(etaTmp)/(1 + exp(etaTmp))
data$Y <- sapply(piTmp, rbinom, n = 1, size = 1)
return(data)
}
# Data Simulation Parameters:
n <- 100
mi <- 30
beta0FE <- 0.3
beta1 <- 1
beta0RE <- rnorm(n, 0, 1)
# Simulation Parameters:
nSim <- 10000
nIter <- 4000
LME_Freq_Fun <- function(data){
fit.LME <- try(glmmPQL(Y ~ x, random = ~ 1 | id,
family = binomial, data = data))
print(class(fit.LME)[1])
if (class(fit.LME)[1] == "try-error"){
return(list(coef.FE.LME = rep(NA, 3),
coef.RE.LME = rep(NA, n)))
}else{
return(list(coef.FE.LME = fit.LME$coefficients$fixed,
coef.RE.LME = fit.LME$coefficients$random$id))
}
}
LME_Freq_RSLT <- foreach (i = 1:nSim) %dopar%{
print(paste("i=", i))
print("-------")
data <- dataSim(i, n, mi, beta0FE, beta1, beta0RE)
LME_Freq_Fun(data)
}
# Extracting elements from foreach:
Coef.Fixed.LME <- t(sapply(1:nSim ,function (i)
return(LME_Freq_RSLT[[i]][[1]] )))
Coef.Rand.LME <- t(sapply(1:nSim ,function (i)
return(LME_Freq_RSLT[[i]][[2]] )))
With no "try", I keep getting:
task 3527 failed - "$ operator is invalid for atomic vectors"
With try, my Coef.Fixed.LME becomes very strange with all elements "numeric.2" !

Related

parallelizing lapply with parLapply does not recognize objects even though I suppied them

I am trying to parallelize something with parLapply. I am exporting all necessary information to the cores, but somehow I am getting an error saying that it cannot find the object 'market_time' (first line of the function that is called in parLapply. However, this object is just a column of the data table 'dt' that I export to the cores.
library('data.table')
library('numDeriv')
library('snow')
cores=detectCores()
cl <- makeCluster(cores[1], type = 'PSOCK')
markets <- unique(dt[, market_time])
R = 10000
nu_p <- rnorm(n = R, -2, 0.5)
nu_xr <- rnorm(n = R, 2, 0.5)
nu_xm <- rnorm(n = R, 2, 0.5)
nu_xj <- rnorm(n = R, 2, 0.5)
clusterExport(cl,c('dt','nu_p','nu_xr','nu_xm','nu_xj')
temp <- parLapply(cl, markets,calc_mc_w, dt=dt,nu_p=nu_p,nu_xr= nu_xr,
nu_xm=nu_xm,nu_xj=nu_xj)
where the function calc_mc_w calls:
calc_mc_w <- function(m, dt,nu_p,nu_xr,nu_xm,nu_xj){
dt_mkt = dt[market_time==m,]
market_time <- dt_mkt[, market_time]
x_m <- dt_mkt[, x_m]
x_j <- dt_mkt[, x_j]
x_r <- dt_mkt[, x_r]
p <- as.matrix(dt_mkt[, p])
xi <- dt_mkt[, xi]
p <- as.matrix(dt_mkt[, p])
jacobian <- jacobian(function(x){calc_shares(x, x_m, x_j, x_r, xi, nu_p,
nu_xm, nu_xj, nu_xr,
market_time)},p)
output <- dt_mkt[,c('prod','market','time','retailer')]
#Get a system of equations with as many equations as unknowns
retailers = unique(dt_mkt[, retailer])
temp <- lapply(retailers,calc_mc_w_r,dt_mkt = dt_mkt, jacobian = jacobian)
temp <- rbindlist(temp)
output <- merge(output,temp,by.x = c('prod','retailer'),
by.y = c('prod','retailer'), allow.cartesian=TRUE)
output
}
calc_mc_w_r <- function(r, dt_mkt, jacobian){
dt_r = dt_mkt[retailer == r,]
result <- dt_r[,c('prod','retailer')]
rows = (dt_mkt[,'retailer']== r)
jacobian_r = jacobian[rows,rows]
result <- result[,mc_w := solve(jacobian_r, dt_r[,shares]+ jacobian_r %*% dt_r[,p])]
result
}
The error I get is:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: object 'market_time' not found
If instead, I do not export the data table dt, but instead each column of it, I get the same error but just for 'jacobian' which is something that I calculate in the function (I do not want to calculate it across the whole dataset as it is super costly, which is why I just want to calculate it on each subset).

lme4 allFit() giving confusing results when wrapped in a function

I am using allFit() in lme4 to automatically scan through possible optimizers, since the default optimizer usually doesn't converge in this situation. My code works fine when I run it line-by-line, but when I run it wrapped in a simple function, it gives different results.
I've looked at the output of the allFit call and it seems that when it's NOT inside the function, it returns a list of lmerModLmerTest objects as desired.
However, inside the function, it returns a list with the values simpleError, error, and condition. Why is it doing this?
I'm using RStudio, R 3.6, lme4 1.1-21, lmerTest 3.1-0.
UPDATE: The problem is that the update() method used by allFit cannot find the 'tt' data frame when re-fitting the models. I have put breakpoints into the code and it seems that the 'test' data exists in the function environment, however, so I don't understand why it can't find it...
UPDATE 2: It appears that if I change the assignment of the test data to <<-, it works. This is dangerous, though, by breaking functional programming, and I think it may fail when I try to parallelize. I am testing further... still open to suggestions!
Here is the code that works, not inside the function:
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
set.seed(2)
test_1 <- multi_arm_var_sim()
model_flex_1 <- lmer(linpred ~ arm + (1 + arm | clustid),
data = test_1)
diff_optims_1 <- allFit(model_flex_1, verbose = TRUE)
print(class(diff_optims_1[[1]]))
is.OK_1 <- sapply(diff_optims_1, is, "lmerMod")
print(is.OK_1)
And here is the code that doesn't work, same setup, wrapped in a function.
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
get_pval <- function(){
tt <- multi_arm_var_sim()
model_flex <- lme4::lmer(linpred ~ arm + (1 + arm | clustid),
data = tt)
diff_optims <- lme4::allFit(model_flex, data = tt, verbose = TRUE)
print(class(diff_optims[[1]]))
is.OK <- sapply(diff_optims, is, "merMod")
print(is.OK)
}
set.seed(2)
get_pval()
Thanks!!

How to make this R code (for loop) more efficient?

I am doing a simulation study and I wrote the following R code. Is there anyway to write this code without using two for loop, or make it more efficient (run faster)?
S = 10000
n = 100
v = c(5,10,50,100)
beta0.mle = matrix(NA,S,length(v)) #creating 4 S by n NA matrix
beta1.mle = matrix(NA,S,length(v))
beta0.lse = matrix(NA,S,length(v))
beta1.lse = matrix(NA,S,length(v))
for (j in 1:length(v)){
for (i in 1:S){
set.seed(i)
beta0 = 50
beta1 = 10
x = rnorm(n)
e.t = rt(n,v[j])
y.t = e.t + beta0 + beta1*x
func1 = function(betas){
beta0 = betas[1]
beta1 = betas[2]
sum = sum(log(1+1/v[j]*(y.t-beta0-beta1*x)^2))
return((v[j]+1)/2*sum)
}
beta0.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[1]
beta1.mle[i,j] = nlm(func1,c(1,1),iterlim = 1000)$estimate[2]
beta0.lse[i,j] = lm(y.t~x)$coef[1]
beta1.lse[i,j] = lm(y.t~x)$coef[2]
}
}
The function func1 inside the second for loop is used for nlm function (to find mle when errors are t distributed).
I wanted to use parallel package in R but I didn't find any useful functions.
The key to getting anything to run faster in R is replacing for loops with vectorized functions (such as the apply family). Additionally, as for any programming language, you should look for places where you are calling expensive functions (such as nlm) more than once with the same parameters and see where you can store the results rather than recomputing each time.
Here I am starting as you did by defining the parameters. Also since beta0 and beta1 always 50 and 10 I am going to define those here as well.
S <- 10000
n <- 100
v <- c(5,10,50,100)
beta0 <- 50
beta1 <- 10
Next we will define func1 outside the loop to avoid redefining it each time. func1 now has two extra parameters, v and y.t so that it can be called with the new values.
func1 <- function(betas, v, y.t, x){
beta0 <- betas[1]
beta1 <- betas[2]
sum <- sum(log(1+1/v*(y.t-beta0-beta1*x)^2))
return((v+1)/2*sum)
}
Now we actually do the real work. Rather than having nested loops, we use nested apply statements. The outer lapply will make a list for each value of v and the inner vapply will make a matrix for the four values you want to get (beta0.mle, beta1.mle, beta0.sle, beta1.lse) for each value of S.
values <- lapply(v, function(j) vapply(1:S, function(s) {
# This should look familiar, it is taken from your code
set.seed(s)
x <- rnorm(n)
e.t <- rt(n,j)
y.t <- e.t + beta0 + beta1*x
# Rather than running `nlm` and `lm` twice, we run it once and store the results
nlmmod <- nlm(func1,c(1,1), j, y.t, x, iterlim = 1000)
lmmod <- lm(y.t~x)
# now we return the four values of interest
c(beta0.mle = nlmmod$estimate[1],
beta1.mle = nlmmod$estimate[2],
beta0.lse = lmmod$coef[1],
beta1.lse = lmmod$coef[2])
}, numeric(4)) # this tells `vapply` what to expect out of the function
)
Finally we can reorganize everything into the four matrices.
beta0.mle <- vapply(values, function(x) x["beta0.mle", ], numeric(S))
beta1.mle <- vapply(values, function(x) x["beta1.mle", ], numeric(S))
beta0.lse <- vapply(values, function(x) x["beta0.lse.(Intercept)", ], numeric(S))
beta1.lse <- vapply(values, function(x) x["beta1.lse.x", ], numeric(S))
As a final note, it may be possible to reorganize this to run even faster depending on why you are using the S index to set the seed. If it is important to know what seed was used to generate your x with rnorm then this may be there best I can do. However if you are only doing it to ensure that all of your values of v are being tested on the same values of x then there may be more reorganizing we can do that may produce more speed up using replicate.

R: How to add jitter only on singular matrices within a function?

I have the following function that I need to (m)apply on a list of more than 1500 large matrices (Z) and a list of vectors (p) of the same length. However, I get the error that some matrices are singular as I already posted here. Here my function:
kastner <- function(item, p) { print(item)
imp <- rowSums(Z[[item]])
exp <- colSums(Z[[item]])
x = p + imp
ac = p + imp - exp
einsdurchx = 1/as.vector(x)
einsdurchx[is.infinite(einsdurchx)] <- 0
A = Z[[item]] %*% diag(einsdurchx)
R = solve(diag(length(p))-A) %*% diag(p)
C = ac * einsdurchx
R_bar = diag(as.vector(C)) %*% R
rR_bar = round(R_bar)
return(rR_bar)
}
and my mapply command that also prints the names of the running matrix:
KASTNER <- mapply(kastner, names(Z), p, SIMPLIFY = FALSE)
In order to overcome the singularity problem, I want to add a small amount of jitter the singular matrices. The problem starts in line 9 of the function R = solve(diag(length(p))-A) %*% diag(p) as this term(diag(length(p))-A) gets singular and can't be solved. I tried to add jitter to all Z matrices in the first line of the function using: Z <- lapply(Z,function(x) jitter(x, factor = 0.0001, amount = NULL)), but this is very very low and produces still errors.
Therefore my idea is to check with if/else or something similar if this matrix diag(length(p))-A is singular (maybe using eigenvectors to check collinearity) and add on those matrices jitter, else (if not) the solve command should performed as it is. Ideas how to implement this on the function? Thanks
Here some example data, although there is no problem with singularity as I was not able to rebuild this error for line 9:
Z <- list("111.2012"= matrix(c(0,0,100,200,0,0,0,0,50,350,0,50,50,200,200,0),
nrow = 4, ncol = 4, byrow = T),
"112.2012"= matrix(c(10,90,0,30,10,90,0,10,200,50,10,350,150,100,200,10),
nrow = 4, ncol = 4, byrow = T))
p <- list("111.2012"=c(200, 1000, 100, 10), "112.2012"=c(300, 900, 50, 100))
Edit: a small amount o jitter shouldn't be problematic in my data as I have probably more than 80% of zeros in my matrices and than large values. And I am only interested in those large values, but the large amount of 0s are probably the reason for the singularity, but needed.
Since you didn't provide a working example I couldn't test this easily, so the burden of proof is on you. :) In any case, it should be a starting point for further tinkering. Comments in the code.
kastner <- function(item, p) { print(item)
imp <- rowSums(Z[[item]])
exp <- colSums(Z[[item]])
x = p + imp
ac = p + imp - exp
einsdurchx = 1/as.vector(x)
einsdurchx[is.infinite(einsdurchx)] <- 0
# start a chunk that repeats until you get a valid result
do.jitter <- TRUE # bureaucracy
while (do.jitter == TRUE) {
# run the code as usual
A = Z[[item]] %*% diag(einsdurchx)
# catch any possible errors, you can even catch "singularity" error here by
# specifying error = function(e) e
R <- tryCatch(solve(diag(length(p))-A) %*% diag(p), error = function(e) "jitterme")
# if you were able to solve(), and the result is a matrix (carefuly if it's a vector!)...
if (is.matrix(R)) {
# ... turn the while loop off
do.jitter <- FALSE
} else {
#... else apply some jitter and repeat by construcing A from a jittered Z[[item]]
Z[[item]] <- jitter(Z[[item]])
}
}
C = ac * einsdurchx
R_bar = diag(as.vector(C)) %*% R
rR_bar = round(R_bar)
return(rR_bar)
}

loop in r until value converges and stores all the outputs

I would like to repeat the process unless a condition is met at the sometime storing the outcomes.
Here is a simple case where I know number of cycles to perform in the loop:
# just example data
smpls <- rnorm(100,50,50)
ncycles <- 1000
outm <- matrix(nrow=ncycles, ncol = 1)
# repeate the process for n cycles
for(i in 1:ncycles){
outm[i] <- mean(sample(smpls, 50))
}
# get average of outm
outm <- mean(sample(smpls, 50))
But my case is different in the sense that I do not know ncyles. I want to continue sampling unless the samples will get very low variance or converges (I guess it is "while" loop. For example unless vsd is less than 1 in following case.
vsd <- NULL
outm <- mean(sample(smpls, 50))
while (vsd > 1){
outm[i] <- mean(sample(smpls, 50))
vsd <- sd(outm)
}
I do not know the value of i here to be set. Help appreciated
Edits:
smpls <- rnorm(100,50,50)
iter <- 0
# maximum iteration
itermax <- 1000
outm <- rep(NA, itermax)
vsd <- 2
while((vsd > 1 ) && (iter < itermax)) {
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm)
iter <- iter+1
}
Error in while ((vsd > 1) && (iter < itermax)) { :
missing value where TRUE/FALSE needed
Main idea of stopping when it reaches convergence is to save time. Although the above example with just mean function is quick, my original function need significant time to do iterations and I want to stop it when it converges.
Two problems in your code:
1) you need sd(... , na.rm = TRUE)
2) you need to be sure that there are at least two numbers in outm for sd(outm, na.rm = TRUE) != NA
Just by the way, given the sd you specify to rnorm, I don't think you'll ever need more than a couple of dozen iterations
sim <- function() {
smpls <- rnorm(100,50,5)
itermax <- 1000
outm <- rep(NA, itermax)
outm[1] <- mean(sample(smpls, 50))
iter <- 1
vsd <- 2
while((vsd > 1 ) && (iter < itermax)) {
iter <- iter+1
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm, na.rm = TRUE)
}
iter
}
set.seed(666)
iters <- replicate(100000, sim() )
range(iters) # c(2, 11)
Cheers.
Here is a solution:
data
set.seed(123) # so that you can replicate what I did
smpls <- rnorm(100,50,50)
I think you need some initialization cycles (minimum iterations) so that your do get false convergence because you have small number of samples. So run few samples - say miniter. You also need a maximum iteration so that your loop do not became wild - say maxiter.
meanconverge <- function (data, miniter, maxiter, tolerance){
outm <- rep(NA, maxiter)
for(i in 1:miniter){
outm[i] <- mean(sample(smpls, 50))
}
# sd of initial cycles
vsd <- sd(outm, na.rm = TRUE)
if(vsd > tolerance) {
iter <- miniter+1
sdout <- rep(NA, maxiter)
while((vsd > tolerance ) && (iter < maxiter)) {
iter <- iter + 1
outm[iter] <- mean(sample(smpls, 50))
vsd <- sd(outm, na.rm = TRUE)
sdout[iter] <- vsd
}
out <- list(outm, sdout)
return(out)
} else {
return(outm)
}
}
out <- meanconverge (data = smpls, miniter = 50, maxiter = 100000, tolerance = 3)
plot(unlist(out[2]), pch = ".", col = "red")
plot(unlist(out[1]), pch = ".", col = "red")
Checking for convergence is a tricky thing. A great way to get started is to look at how the value changes while you are computing. Convergence is all about getting arbitrarily close to a boundary; programmatically, you have to make a choice what "arbitrary" means. You also need to make a decision about how you will measure convergence.
To illustrate, suppose I want to know if my estimates for meeting my condition are getting really close to each other. I may have something like:
# inside my function or method that performs this convergence feat
while (while_condition && i < itermax)) {
outcome[i] <- some_complicated_foo(bar)
if ( abs(outcomes[i-1] - outcomes[i]) <= tolerance ) {
while_condition <- FALSE # i.e. STOP LOOPING
return outcomes
}
else {continue}
}
Where tolerance is your definition of arbitrary closeness. Now, this seems like a hammer to your nail right? Well, what happens if you converge to the wrong answer? How will you know? Does this thing even converge? The trick to these kinds of problems is making cleaver guesses about your functions or the data generating process you are analyzing. But, having a maximum iteration boundary will definitely ease some computing time as long as it is reasonable. The real way to know if you are right is to use tests (like statistical tests or unit tests) to see if there is any 'garbage-in-garbage-out' or getting something different than you'd expect with a contrived example with a well known answer.
Check out optimization algorithms and see how they do it. See ?optim or some other optimization package to see how the pros do it.

Resources