There are similar questions on StackOverflow, however, I am failing to find a solution identical (or remotely similar) to my issue. I am running a simulation as follows in R
iterSim <- 1000
for(i in 1:iterSim){
#Generate some random data
Z <- NormalDGP(n=T, beta=coef_min, theta=Theta, rho=Rho, ...)
y <- Z[1:(length(Z[,1])-1),1]
x <- Z[2:length(Z[,1]),2]
#Conduct the following tests! However, Dec_POS_Dep sometimes gives an error
Dec_POS_Dep <- POS_Dep(y, x, simul=TRUE, trueBeta=coef_min, ...)
Dec_POS_Fix <- POS_Fix(y, x, simul=TRUE, trueBeta=coef_min, ...)
Dec_CD_95 <- CD_95(y, x, simul=TRUE)
}
where for each iteration i random numbers are generated and three tests are run - i.e., Dec_POS_Dep,Dec_POS_Fix and Dec_CD_95. Unfortunately, sometimes in the simulation Dec_POS_Dep gives an error and the simulation terminates. I am not looking for the loop to skip an iteration if an error is given (as per many suggestions on StackOverflow); however, I would like that iteration to be repeated. E.g. If the code is on the 265th iteration and Dec_Pos_Dep gives an error, I want it to give it many more shots at the 265th iteration. Some solution to this would very much be appreciated.
Two things stand out as broken here:
Use of try (as MartinGal suggested) or tryCatch will allow things to continue. As far as rerunning that iteration, you'll need to keep track somehow of these failed runs and run them yourself, there's no notion of telling R to repeat a for loop iteration.
You are discarding data on each iteration, Dec_CD_95 is overwritten each time. Perhaps you mean to keep things around?
Here's a suggestion:
iterSim <- 1000
out <- list()
while (length(out) < iterSim) {
try({
#Generate some random data
Z <- NormalDGP(n=T, beta=coef_min, theta=Theta, rho=Rho, ...)
y <- Z[1:(length(Z[,1])-1),1]
x <- Z[2:length(Z[,1]),2]
#Conduct the following tests! However, Dec_POS_Dep sometimes gives an error
Dec_POS_Dep <- POS_Dep(y, x, simul=TRUE, trueBeta=coef_min, ...)
Dec_POS_Fix <- POS_Fix(y, x, simul=TRUE, trueBeta=coef_min, ...)
Dec_CD_95 <- CD_95(y, x, simul=TRUE)
out <- c(out, list(Dec_POS_Dep, Dec_POS_Fix, Dec_CD_95))
}, silent = TRUE)
}
This is a little sloppy, admittedly, but it should always end up with 1000 iterations of your simulation.
Related
I am assisting a colleague with adding functionality to one of his R packages.
I have implemented nonparametric bootstrapping using a for loop construct in R.
# perform resampling
# resample `subsample_size` values with or without replacement replicate_size times
for (i in 1:replicate_size) {
if (replacement == TRUE) { # bootstrapping
z <- sample(x, size = subsample_size, replace = TRUE)
zz <- sample(x, size = subsample_size, replace = TRUE)
} else { # subsampling
z <- sample(x, size = subsample_size, replace = FALSE)
zz <- sample(x, size = subsample_size, replace = FALSE)
}
# calculate statistic
boot_samples[i] <- min(zz) - max(z)
}
The above loop is nested within another for loop, which itself is nested within a function (details not shown). The code I'm dealing with is messy, and there are most certainly more efficient ways of coding things up, but I've had to leave it be since my colleague is only familiar with very basic and rudimentary coding constructs.
Upon running said function, I specified all required arguments (replicate_size, replacement) except subsample_size. subsample_size is needed to carry out the resampling. This mistake on my part was revealing because, for some strange reason, the code still runs without throwing an error regarding missing a value for subsample_size.
Question: Does anyone have any idea on why this happens?
I'd include more code, but it is very verbose and unwieldy (his code, not mine). Running the for loop outside the function does indeed raise the error regarding the missing value as expected.
I have this part of my jags code. I really can't see where the code gets out of the range. Can anyone please see any error that I can't recognize? These are the data sizes.
N = 96
L = c(4,4,4,4,4)
length(media1) = 96
length(weights1) = 4
for(t in 1:N){
current_window_x <- ifelse(t <= L[1], media1[1:t], media1[(t - L[1] + 1):t])
t_in_window <- length(current_window_x)
new_media1[t] <- ifelse(t <= L[1], inprod(current_window_x, weights1[1:t_in_window]),
inprod(current_window_x, weights1))
}
The error is (where line 41 correspond to the first line in the loop)
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Compilation error on line 41.
Index out of range taking subset of media1
I actually just happened on to the answer here earlier today for something I was working on. The answer is in this post. The gist is that ifelse() in jags is not a control flow statement, it is a function and both the TRUE and FALSE conditions are evaluated. So, even though you are saying to use media1[1:t] if t<=L[1], the FALSE condition is also being evaluated which produces the error.
The other problem once you're able to fix that is that you're re-defining the parameter current_window_x, which will throw an error. I think the easiest way to deal with the variable window width is just to hard code the first few observations of new_media and then calculate the remaining ones in the loop, like this:
new_media[1] <- media1[1]*weights1[1]
new_media[2] <- inprod(media1[1:2], weights1[1:2])
new_media[3] <- inprod(media1[1:3], weights1[1:3])
for(t in 4:N){
new_media[t] <- inprod(media1[(t - L[1] + 1):t], weights1)
}
I am trying to optimize a function using the genoud genetic optimizer in R, which is part of library(rgenoud). I have limitted experience with setting up parallel processing in R.
genoud has a built in cluster option to makeClusters whose names are known. I want to set up parallel processing on a single multi-core machine. I understand the local machine's name is localhost and repeating the name will use more than one core on the machine.
This approach seems to be working fine for simple functions such as:
genout <- genoud(sin, 1, cluster=c('localhost','localhost','localhost'))
However when I optimize more complex function structures some of the machines or environments do not find all functions. Here is an example:
fun1 <- function(x) {sin(x)}
fun2 <- function(x) {
x <- fun1(x)
ret <- x + cos(x)
return(ret)
}
genout <- genoud(fun2, 1, cluster=c('localhost','localhost','localhost'))
This gives error
Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: could not find function "fun1"
One solution seems to be to embed fun1 into fun2. However, this seems inefficient if fun2 is run a large number of times (in other cases than the silly example). Is it true that the only way to solve this problem is embedding fun1 in fun2?
Edit: Even when embedding, there are more problems when objects need to be passed to fun2 via genoud, see:
y=1
fun2 <- function(x,z) {
fun1 <- function(x) {sin(x)}
x <- fun1(x)
ret <- x + cos(x)*z
return(ret)
}
genout <- genoud(fun2, 1, cluster=c('localhost','localhost','localhost'), z=y)
>Error in checkForRemoteErrors(val) :
3 nodes produced errors; first error: object 'y' not found
I think the problem is that the genoud function is not properly evaluating and sending the additional arguments to the workers. In this case, the argument z is sent to the workers without being evaluated, so the workers have to evaluate it, but they fail because the variable y wasn't sent to the workers.
A work-around is to export the necessary variables to the workers using clusterExport, which requires you to explicitly create the cluster object for genoud to use:
library(rgenoud)
library(parallel)
cl <- makePSOCKcluster(3)
fun1 <- function(x) {sin(x)}
fun2 <- function(x, z) {
x <- fun1(x)
x + cos(x) * z
}
y <- 1
clusterExport(cl, c('fun1', 'y'))
genout <- genoud(fun2, 1, cluster=cl, z=y)
This also exports fun1, which is necessary since genoud doesn't know that fun2 depends on it.
I am running into this error when running random point generation over a loop. spsample works fine if I generate points just once, but if I try this repeatedly I end up (sooner or later) with this error. Any ideas how to solve it properly (I mean theoretically I could just skip the faulty iteration but this is not nice coding, right?). This problem seems to happen only with the "random" option.
data(meuse.riv)
meuse.sr = SpatialPolygons(list(Polygons(list(Polygon(meuse.riv)), "x")))
#works fine if run just once
n<-10
points<-spsample(meuse.sr, n, "random")
for (i in 1:5000){
print(i)
points<-spsample(meuse.sr, n, "random")
}
I guess you should follow the advice of the error message:
for (i in 1:5000){
print(i)
points<-spsample(meuse.sr, n, "random", iter=10)
}
ran through all 5000 iterations without an error message. ?spsample says
iter(default = 4) number of times to try to place sample points in a polygon before giving up and returning NULL - this may occur when trying to hit a small and awkwardly shaped polygon in a large bounding box with a small number of points.
So giving it more chances can solve the problem.
Users,
I am looking for a solution to "parallelize" my PLSR predictions in order to save pprocessing time. I was trying to use the "foreach" construct with "doPar" (cf. 2nd part of code below), but I was unable to allocate the predicted values as well as the model performance parameters (RMSEP) to the output variable.
The code:
set.seed(10000) # generate some data...
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method = "euclidean") # distance matrix to find close samples
eDm <- as.matrix(eD)
kns <- matrix(NA,nrow(x),10) # empty matrix to allocate 10 closest samples
for (i in 1:nrow(eDm)) { # identify closest samples in a loop and allocate to kns
kns[i,] <- head(order(eDm[,i]), 11)[-1]
}
So far I consider the code as "safe", but the next part is challenging me, since I never used the "foreach" construct before:
library(pls)
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
out <- foreach(j = 1:nrow(mat), .combine="rbind", .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
predict(pls, ncomp=5, newdata=x[j,,drop=F])
RMSEP(pls, estimate="CV")$val[1,1,5]
}
stopCluster(cl)
As I understand, the code line starting with "RMSEP(pls,..." is simply overwriting the previously written data from the "predict" code line. Somehow I was assuming the .combine option would take care of this?
Many thanks for your help!
Best, Chega
If you want to return two objects from the body of a foreach loop, you need to put them into an object such as a list:
out <- foreach(j = 1:nrow(mat), .packages="pls") %dopar% {
pls <- plsr(y ~ x, ncomp=5, validation="CV", , subset=kns[j,])
list(p=predict(pls, ncomp=5, newdata=x[j,,drop=F]),
r=RMSEP(pls, estimate="CV")$val[1,1,5])
}
Only the "final value" of the loop body is returned to the master and then processed by the .combine function.
Note that I removed the .combine argument so that the result will be a list of lists of length 2. It's not clear to me that rbind is the appropriate function to use to process the results.
Since this question was originally answered, the pls package has been modified to allow the cross-validation to be run in parallel. The implementation is trivially easy--simply a matter of defining either a persistent cluster, or the number of cores to use in a transient cluster, in pls.options.
If transient clusters are used, implementation literally requires only two lines of code:
library(parallel)
pls.options(parallel=NumberOfCoresToUse)
No changes to the output variables are needed.
I haven't checked whether parallelizing at the calibration level, as in the question, would be more efficient. I suspect it would be, particularly when the number of calibration iterations is much larger than the number of cross-validation steps (especially when the number of CVs isn't a multiple of the number of cores used), but this approach is so straightforward that the extra coding effort may not be worth it.