R package submission error concerning set.seed() - r

I recently submitted a package to CRAN that passed all the automatic checks, but failed passing the manual ones. One of the errors were the following:
Please do not set a seed to a specific number within a function.
Please do not modifiy the .GlobalEnv. This is not allowed by the CRAN policies.
I believe the lines of code that these comments are referring to are the following
if(simul == TRUE){
set.seed(42)
}
w <- matrix(data = rbinom(n = p, size = 1, prob = 0.5), ncol = 1)
beta <- w*beta-(1-w)*beta
s <- round((1-sparsity)*p)
toReplace <- sample(p, size = s)
beta <- replace(beta, list = toReplace, values = 0)
# Generate the random p-columned matrix of indicator series.
X <- matrix(data = rnorm ((n_l*m) * p, mean = mean_X, sd = sd_X), ncol = p, nrow = n_l*m)
if(simul == TRUE){
rm(.Random.seed, envir = globalenv())
}
Essentially, I am allowing the function to include a simulations option "simul", such that when set to "TRUE", a matrix "X" and a vector of coefficients "beta" remain fixed. I remove the seed at the end of this segment (final lines), as the rest of the code contains variables that should change at each iteration of the simulation. However, as noted in the feedback from CRAN, this is not allowed. What is an alternative way to go about this? I cannot set a fixed vector "beta" or matrix "X" when "simul" is "TRUE", since the dimension of these are inputs to the function and thus vary depending on the preferences of the investigator.

If you really, really, want to set the seed inside a function, which I believe you nor anyone should do, save the current seed, do whatever you want, and before exiting the function reset it to the saved value.
old_seed <- .Random.seed
rnorm(1)
#[1] -1.173346
set.seed(42)
rbinom(1, size = 1, prob = 0.5)
#[1] 0
.Random.seed <- old_seed
rnorm(1)
#[1] -1.173346
In a function it could be something like the following, without the message instructions. Note that the function prints nothing, it never calls any pseudo-RNG and always outputs TRUE. The point is to save the seed's current value and reset the seed in on.exit.
f <- function(simul = FALSE){
if(simul){
message("simul is TRUE")
old_seed <- .Random.seed
on.exit(.Random.seed <- old_seed)
# rest of code
} else message("simul is FALSE")
invisible(TRUE)
}
f()
s <- .Random.seed
f(TRUE)
identical(s, .Random.seed)
#[1] TRUE
rm(s)

A similar question has been asked on the Bio devel mailing list. The suggestion there was to use the functionality of withr::with_seed. Your code could then become:
library(withr)
if(simul == TRUE){
w <- with_seed(42, matrix(data = rbinom(n = p, size = 1, prob = 0.5), ncol = 1))
} else {
w <- matrix(data = rbinom(n = p, size = 1, prob = 0.5), ncol = 1)
}
beta <- w*beta-(1-w)*beta
s <- round((1-sparsity)*p)
toReplace <- sample(p, size = s)
beta <- replace(beta, list = toReplace, values = 0)
# Generate the random p-columned matrix of indicator series.
X <- matrix(data = rnorm ((n_l*m) * p, mean = mean_X, sd = sd_X), ncol = p, nrow = n_l*m)
Of course that raises the question of how withr got on CRAN, given that it appears to do the same thing that you're being told not to do - the difference may be that your version may overwrite an existing seed, whereas that code checks whether a seed already exists.

When you fix the seed, if the user try this code with the same parameters, the same results will be obtained each time.
Supposing that this chunk of code is inside a larger chunk related only to the simulation, just get rid of the setseed() and try something like that:
if(simul == TRUE){
w <- matrix(data = rbinom(n = p, size = 1, prob = 0.5), ncol = 1)
beta <- w*beta-(1-w)*beta
s <- round((1-sparsity)*p)
toReplace <- sample(p, size = s)
beta <- replace(beta, list = toReplace, values = 0)
# Generate the random p-columned matrix of indicator series.
X <- matrix(data = rnorm ((n_l*m) * p, mean = mean_X, sd = sd_X), ncol = p, nrow = n_l*m)
}

Related

R - fast interpolation between CDF quantiles

Say we have a data.frame where the columns represent the quantiles for a given set of probabilities. Each row represents a different subject and the quantiles vary by subject. The goal is to take n_draws for each subject.
n <- 1e5
alphas <- c(.05, .25, .5, .75, .95)
n_draws <- 100
dt <- data.frame(quantile_05 = runif(n),
quantile_25 = runif(n, min = 10, max = 20),
quantile_5 = runif(n, min = 30, max = 40),
quantile_75 = runif(n, min = 50, max = 60),
quantile_95 = runif(n, min = 70, max = 80))
R has stats::approx. The issue is that it can only be applied to 1 row at once.
draws <- apply(X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
Naturally, one way to speed this up is parallelization:
library(parallel)
registerDoParallel(cores=8)
cl <- makeCluster(8)
clusterExport(cl, c('alphas', 'n_draws'))
draws <- parApply(cl=cl, X = dt, MARGIN = 1, function(q){
stats::approx(x=alphas, y=q, yleft = 0, rule = 2, method="linear", xout = seq(.0001, .99999, length.out = n_draws))$y
})
This parallel code is several times faster on my machine. I am curious if anyone has suggestions for further speed ups.
To be a little more clear, this is just a simplification of what I'm using in real life. I have more than 5 quantiles. I want to use this concept to estimate not only other quantiles, but also the mean. Further, I'd like to use the draws to model other quantities like max(y - 10, 0) (or y - any arbitrary value).
Your proposed method is rather inefficient for the desired goal. You end up storing 100 points for each desired ECDF and you will still need to write a function to extract a desired draw. Instead I suggest you consider using the approxfun functions. It will return a more compact set of values which will be individual functions with associated environments that contain the knots for later calculation. The calculation will be done with a C call that is accessed with the invisible helper function, .approxfun.
Demonstrating the internals (up to a point) that I'm suggesting:
out <- approxfun(y=dt[1,], x=alphas,yleft = 0, rule = 2, method="linear")
# So out is now a single instance using the knots in the first row
out
#function (v)
#.approxfun(x, y, v, method, yleft, yright, f, na.rm)
#<bytecode: 0x558366535968>
#<environment: 0x5583690a04f8>
ls(environment(out))
#[1] "f" "method" "na.rm" "x" "y" "yleft" "yright"
environment(out)$x
#[1] 0.05 0.25 0.50 0.75 0.95
environment(out)$y
#[1] 0.4038727 17.7069735 33.4438595 57.2753257 77.2024894
If you wanted the estimated 55th percentile for the first case, you could get it with:
out(55/100)
#[1] 38.21015
And now that I've suggested a way to speed up you creation of this list of functions, I'm not even sure it's worth it. I think you could just leave that dt dataframe in place and call approxfun when needed. But that's your call.
Note: This is essentially the method used by the ecdf function:
ecdf
function (x)
{
x <- sort(x)
n <- length(x)
if (n < 1)
stop("'x' must have 1 or more non-missing values")
vals <- unique(x)
rval <- approxfun(vals, cumsum(tabulate(match(x, vals)))/n,
method = "constant", yleft = 0, yright = 1, f = 0, ties = "ordered")
class(rval) <- c("ecdf", "stepfun", class(rval))
assign("nobs", n, envir = environment(rval))
attr(rval, "call") <- sys.call()
rval
}
<bytecode: 0x558364a0f360>
<environment: namespace:stats>
And it's possible that you might want to use the ecdf function because it has some class-associated functions.

How do I iterate over several lists and matrices to call a function using parallel processing in R?

I have been trying to use am R function called ipsi, which takes arguments (a, y, id, time, x.trt, x.out, delta.seq, nsplits) Originally, the components of the arguments were in one dataframe (except for delta.seq and nsplits which are coded later), but my understanding is I needed to put them in separate lists, and in the case of x.trt and x.out, matrices. This function is very easy to run on one of each argument, but since I multiply imputed the dataframe 30 times before splitting it up into different elements to be taken as ipsi arguments, I now want to iterate over the set of elements 30 times as if there were 30 dataframes. Additionally, I want to parallelize to optimize my computing power.
I have just expanded the npcausal example:
n <- 500
T <- 4
time <- rep(1:T, n)
time <- list(time,time,time,time,time,time,time,time,time,time,time,time,time,time,time,
time,time,time,time,time,time,time,time,time,time,time,time,time,time,time)
id <- rep(1:n, rep(T, n))
id <- list(id,id,id,id,id,id,id,id,id,id,id,id,id,id,id,
id,id,id,id,id,id,id,id,id,id,id,id,id,id,id)
x.trt <- matrix(rnorm(n * T * 5), nrow = n * T)
x.trt <- list(x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,
x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt,x.trt)
x.out <- matrix(rnorm(n * T * 5), nrow = n * T)
x.out <- list(x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,
x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out,x.out)
a <- rbinom(n * T, 1, .5)
a <- list(a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a)
y <- rnorm(mean=1,n)
y <- list(y,y,y,y,y,y,y,y,y,y,y,y,y,y,y,
y,y,y,y,y,y,y,y,y,y,y,y,y,y,y)
d.seq <- seq(0.1, 5, length.out = 10)
d.seq <- list(d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,
d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq,d.seq)
set.seed(500, kind = "L'Ecuyer-CMRG")
numcores <- future::availableCores()
cl <- parallel::makeCluster(numcores)
parallel::clusterEvalQ(cl, library(dplyr))
parallel::clusterEvalQ(cl, library(npcausal))
parallel::clusterExport(cl, "d.seq", envir = environment())
parallel::clusterEvalQ(cl, d.seq <- d.seq)
new_element <- parallel::parLapply(cl = cl, for(i in 1:30){
npcausal::ipsi(a = a[[i]],
y = y[[i]],
id = id[[i]],
time = time[[i]],
x.out = x.out[[i]],
x.trt = x.trt[[i]],
delta.seq = d.seq[[i]],
nsplits = 10)
})
This actually runs, but at the end of the process it gives me an error saying that the FUN was missing. I knew that already, but I have no FUN to call besides ipsi. Thanks for any help you can provide.
My suggestion is to first figure out how to do it with a regular base-R *apply function without worrying about parallelization. I suspect you can use mapply() for this, so something like (non confirmed):
res <- mapply(
a, y, id, time, xout, x,out, x.trt, d.seq,
FUN = function(a_i, y_i, id_i, time_i, xout_i, x,out_i, x.trt_i, d.seq_i) {
npcausal::ipsi(a = a_i, y = y_i, id = id_i, time = time_i,
x.out = x.out_i, x.trt = x.trt_i, delta.seq = d.seq_i,
nsplits = 10)
}
)
When you figured that part out, you can start thinking about parallelization.
(Disclaimer: I'm the author) If you get an mapply() solution to work, then the simplest would be to replace that as-is with future_mapply() of the future.apply package. That will parallelize on your local machine if you set plan(multisession).

Time varying parameter-matrix in deSolve R

I am struggling with this for so long. I have a logistic growth function where the growth parameter
r is a matrix. The model is constructed in a way that I have as an output two N the N1 and N2.
I would like to be able to change the r parameter over time. When time < 50 I would like
r = r1 where
r1=matrix(c(
2,3),
nrow=1, ncol=2
When time >= 50 I would like r=r2 where
r2=matrix(c(
1,2),
nrow=1, ncol=2
Here is my function. Any help is highly appreciated.
rm(list = ls())
library(deSolve)
model <- function(time, y, params) {
with(as.list(c(y,params)),{
N = y[paste("N",1:2, sep = "")]
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
r=matrix(c(
4,5),
nrow=1, ncol=2)
K=100
params <- list(r,K)
y<- c(N1=0.1, N2=0.2)
times <- seq(0,100,1)
out <- ode(y, times, model, params)
plot(out)
I would like ideally something like this but it does not work
model <- function(time, y, params) {
with(as.list(c(y,params)),{
N = y[paste("N",1:2, sep = "")]
r = ifelse(times < 10, matrix(c(1,3),nrow=1, ncol=2),
ifelse(times > 10, matrix(c(1,4),nrow=1, ncol=2), matrix(c(1,2),nrow=1, ncol=2)))
print(r)
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
Thank you for your time.
Here a generic approach that uses an extended version of the approx function. Note also some further simplifications of the model function and the additional plot of the parameter values.
Edit changed according to the suggestion of Lewis Carter to make the parameter change at t=3, so that the effect can be seen.
library(simecol) # contains approxTime, a vector version of approx
model <- function(time, N, params) {
r <- approxTime(params$signal, time, rule = 2, f=0, method="constant")[-1]
K <- params$K
dN <- r*N*(1-N/K)
return(list(c(dN), r))
}
signal <- matrix(
# time, r[1, 2],
c( 0, 2, 3,
3, 1, 2,
100, 1, 2), ncol=3, byrow=TRUE
)
## test of the interpolation
approxTime(signal, c(1, 2.9, 3, 100), rule = 2, f=0, method="constant")
params <- list(signal = signal, K = 100)
y <- c(N1=0.1, N2=0.2)
times <- seq(0, 10, 0.1)
out <- ode(y, times, model, params)
plot(out)
For a small number of state variables like in the example, separate signals with approxfun from package stats will look less generic but may be slighlty faster.
As a further improvement, one may consider to replace the "hard" transitions with a more smooth one. This can then directly be formulated as a function without the need of approx, approxfun or approxTime.
Edit 2:
Package simecol imports deSolve, and we need only a small function from it. So instead of loading simecol it is also possible to include the approxTime function explicitly in the code. The conversion from data frame to matrix improves performance, but a matrix is preferred anyway in such cases.
approxTime <- function(x, xout, ...) {
if (is.data.frame(x)) {x <- as.matrix(x); wasdf <- TRUE} else wasdf <- FALSE
if (!is.matrix(x)) stop("x must be a matrix or data frame")
m <- ncol(x)
y <- matrix(0, nrow=length(xout), ncol=m)
y[,1] <- xout
for (i in 2:m) {
y[,i] <- as.vector(approx(x[,1], x[,i], xout, ...)$y)
}
if (wasdf) y <- as.data.frame(y)
names(y) <- dimnames(x)[[2]]
y
}
If you want to pass a matrix parameter you should pass a list of parameters and you can modify it inside the model when your time limit is exceeded (in the example below you don't even have to pass the r matrix to the model function)
library(deSolve)
model <- function(time, y, params) {
with(as.list(c(y,params)),{
if(time < 3) r = matrix(c(2,3), nrow = 1, ncol = 2)
else r = matrix(c(1,3), nrow = 1, ncol = 2)
N = y[paste("N",1:2, sep = "")]
dN <- r*N*(1-N/K)
return(list(c(dN)))
})
}
y <- c(N1=0.1, N2=0.2)
params <- list(r = matrix(c(0,0), nrow = 1, ncol = 2), K=100)
times <- seq(0,10,0.1)
out <- ode(y, times, model, params)
plot(out)
You can see examples of this for instance with Delay Differential Equations ?dede

lme4 allFit() giving confusing results when wrapped in a function

I am using allFit() in lme4 to automatically scan through possible optimizers, since the default optimizer usually doesn't converge in this situation. My code works fine when I run it line-by-line, but when I run it wrapped in a simple function, it gives different results.
I've looked at the output of the allFit call and it seems that when it's NOT inside the function, it returns a list of lmerModLmerTest objects as desired.
However, inside the function, it returns a list with the values simpleError, error, and condition. Why is it doing this?
I'm using RStudio, R 3.6, lme4 1.1-21, lmerTest 3.1-0.
UPDATE: The problem is that the update() method used by allFit cannot find the 'tt' data frame when re-fitting the models. I have put breakpoints into the code and it seems that the 'test' data exists in the function environment, however, so I don't understand why it can't find it...
UPDATE 2: It appears that if I change the assignment of the test data to <<-, it works. This is dangerous, though, by breaking functional programming, and I think it may fail when I try to parallelize. I am testing further... still open to suggestions!
Here is the code that works, not inside the function:
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
set.seed(2)
test_1 <- multi_arm_var_sim()
model_flex_1 <- lmer(linpred ~ arm + (1 + arm | clustid),
data = test_1)
diff_optims_1 <- allFit(model_flex_1, verbose = TRUE)
print(class(diff_optims_1[[1]]))
is.OK_1 <- sapply(diff_optims_1, is, "lmerMod")
print(is.OK_1)
And here is the code that doesn't work, same setup, wrapped in a function.
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
get_pval <- function(){
tt <- multi_arm_var_sim()
model_flex <- lme4::lmer(linpred ~ arm + (1 + arm | clustid),
data = tt)
diff_optims <- lme4::allFit(model_flex, data = tt, verbose = TRUE)
print(class(diff_optims[[1]]))
is.OK <- sapply(diff_optims, is, "merMod")
print(is.OK)
}
set.seed(2)
get_pval()
Thanks!!

R Optimization given objective function

obj1<-function(monthly.savings,
success,
start.capital,
target.savings,
monthly.mean.return,
monthly.ret.std.dev,
monthly.inflation,
monthly.inf.std.dev,
n.obs,
n.sim=1000){
req = matrix(start.capital, n.obs+1, n.sim) #matrix for storing target weight
monthly.invest.returns = matrix(0, n.obs, n.sim)
monthly.inflation.returns = matrix(0, n.obs, n.sim)
monthly.invest.returns[] = rnorm(n.obs * n.sim, mean = monthly.mean.return, sd = monthly.ret.std.dev)
monthly.inflation.returns[] = rnorm(n.obs * n.sim, mean = monthly.inflation, sd = monthly.inf.std.dev)
#for loop to be
for (a in 1:n.obs){
req[a + 1, ] = req[a, ] * (1 + monthly.invest.returns[a,] - monthly.inflation.returns[a,]) + monthly.savings
}
ending.values=req[nrow(req),]
suc<-sum(ending.values>target.savings)/n.sim
value<-success-suc
return(abs(value))
}
I have the above objective function that I want to minimize for. It tries to solve for the monthly savings required for a given probability of success. Given the following input assumptions
success<-0.9
start.capital<-1000000
target.savings<-1749665
monthly.savings=10000
monthly.mean.return<-(5/100)/12
monthly.ret.std.dev<-(3/100)/sqrt(12)
monthly.inflation<-(5/100)/12
monthly.inf.std.dev<-(1.5/100)/sqrt(12)
monthly.withdrawals<-10000
n.obs<-10*12 #years * 12 months in a year
n.sim=1000
I used the following notation:
optimize(f=obj1,
success=success,
start.capital=start.capital,
target.savings=target.savings,
monthly.mean.return=monthly.mean.return,
monthly.ret.std.dev=monthly.ret.std.dev,
monthly.inflation=monthly.inflation,
monthly.inf.std.dev=monthly.inf.std.dev,
n.obs = n.obs,
n.sim = n.sim,
lower = 0,
upper = 10000,
tol = 0.000000001,maximum=F)
I get 7875.03
Since I am sampling from a normal distribution, the output will be different each time but they should be around the same give or take a few % points. The problem I am having is that I can't specify a upper limit arbitrarily. The above example's upper limit (10000) is cherry picked after numerous trials. If say I put in a upper limit of 100000 (unreasonable I know) it will return that number as oppose to finding the global minimum saving. Any ideas where I am structuring my objective function incorrectly?
thanks,
The fact that your function does not always return the same output for a given input
is likely to pose a few problems (it will create a lot of spurious local minima):
you can avoid them by setting the seed of the random number generator
inside the function (e.g., set.seed(1)),
or by storing the random numbers and reusing them each time,
or by using a low-discrepancy sequence (e.g., randtoolbox::sobol).
Since it is a function of one variable, you can simply plot it to see what happens:
it has a plateau after 10,000 -- optimization algorithms cannot distinguish
between a plateau and a local optimum.
f <- function(x) {
set.seed(1)
obj1(x,
success = success,
start.capital = start.capital,
target.savings = target.savings,
monthly.mean.return = monthly.mean.return,
monthly.ret.std.dev = monthly.ret.std.dev,
monthly.inflation = monthly.inflation,
monthly.inf.std.dev = monthly.inf.std.dev,
n.obs = n.obs,
n.sim = n.sim
)
}
g <- Vectorize(f)
curve(g(x), xlim=c(0, 20000))
Your initial problem is actually not a minimization problem,
but a root finding problem, which is much easier.
obj2 <- function(monthly.savings) {
set.seed(1)
req = matrix(start.capital, n.obs+1, n.sim)
monthly.invest.returns <- matrix(0, n.obs, n.sim)
monthly.inflation.returns <- matrix(0, n.obs, n.sim)
monthly.invest.returns[] <- rnorm(n.obs * n.sim, mean = monthly.mean.return, sd = monthly.ret.std.dev)
monthly.inflation.returns[] <- rnorm(n.obs * n.sim, mean = monthly.inflation, sd = monthly.inf.std.dev)
for (a in 1:n.obs)
req[a + 1, ] <- req[a, ] * (1 + monthly.invest.returns[a,] - monthly.inflation.returns[a,]) + monthly.savings
ending.values <- req[nrow(req),]
suc <- sum(ending.values>target.savings)/n.sim
success - suc
}
uniroot( obj2, c(0, 1e6) )
# [1] 7891.187

Resources