Below is a MWE of my problem: I have programmed a progress bar for some function using the bootstrap (via the boot function from the boot package).
This works fine as long as I don't use parallel processing (res_1core below). If I want to use parallel processing by setting parallel = "multicore" and ncpus = 2, the progress bar isn't displayed properly (res_2core below).
library(boot)
rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
env <- environment()
counter <- 0
progbar <- txtProgressBar(min = 0, max = R, style = 3)
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data = d)
curVal <- get("counter", envir = env)
assign("counter", curVal + 1, envir = env)
setTxtProgressBar(get("progbar", envir = env), curVal + 1)
return(summary(fit)$r.square)
}
res <- boot(data = data, statistic = bootfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
return(res)
}
res_1core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000)
res_2core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000, parallel = "multicore", ncpus = 2)
I have read that this is related to the fact that the boot function calls on lapply for single core processing and mclapply for multicore processing. Does anyone know of an easy workaround to deal with this? I mean, I would like to display the progress taking into account all of the parallel processes.
Update
Thanks to the input of Karolis Koncevičius, I have found a workaround (just use the updated rsq function below):
rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data = d)
return(summary(fit)$r.square)
}
env <- environment()
counter <- 0
progbar <- txtProgressBar(min = 0, max = R, style = 3)
flush.console()
intfun <- function(formula, data, indices) {
curVal <- get("counter", envir = env) + ncpus
assign("counter", curVal, envir = env)
setTxtProgressBar(get("progbar", envir = env), curVal)
bootfun(formula, data, indices)
}
res <- boot(data = data, statistic = intfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
return(res)
}
Unfortunately, this only works for multicore processing when I run R from the terminal. Any ideas how to patch this so it also displays properly in R console or Rstudio?
Not exactly what you ordered, but might be helpful.
A simple statistics function to boot:
library(boot)
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data=d)
summary(fit)$r.square
}
Higher order function to display progress:
progressReporter <- function(total, nBars=100, f, ...) {
count <- 1
step <- ceiling(total/nBars)
cat(paste(rep("|", nBars), collapse=""), "\r")
flush.console()
function(...) {
if (count %% step==0) {
cat(".")
}
count <<- count + 1
f(...)
}
}
Now this function is cheating - it displays progress every "step" of iterations. If you have 1000 iterations, use two cores and print every 10th iteration - it will do the job. The cores don't share state, but they each will run the counter up to 500, and the function will respond to both counters.
On the other hand if you do 1000 iterations, run 10 cores and report every 200 - the function will stay silent, as all the cores will count to 100 each. None will reach 200 - no progress bar. Hope you get the idea. I think it should be ok in most of the cases.
Try it out:
res_1core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun))
res_2core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun), parallel="multicore", ncpus=2)
Related
I'm tuning parameters with custom summaryFunction in caret.
I originally thought that if I set K-fold cross validation and input data has N points, performance will be measured with N/K data points.
However, apparently it seems not correct because when I extract data$pred by using browser() which is the handed data to summary function, it only had 10 data.
Since the input(df) has over 500 data points, this number is way smaller than my expectation.
Why does it only have 10 data? Is there any way to increase this?(=performance testing with more large data points)
Any kind of help is needed. Thank you.
sigma.list <- seq(1, 5, 1)
c.list <- seq(1, 10, 1)
met <- "FValue"
#define evaluation function
eval <- function(data, lev = NULL, model = NULL){
mat <- table(data$pred, data$obs)
pre <- mat[1,1]/sum(mat[1,]) #precision
rec <- mat[1,1]/sum(mat[,1]) #recall
res <- c("Precision"=pre, "Recall"=rec, "FValue"=2*pre*rec/(pre+rec))
browser()
res
}
#define train control
tc <- trainControl(method = "cv",
number = 5,
summaryFunction = eval,
classProbs = TRUE,
)
#tune with caret
svm.tune <- train(Flag~.,
data = df,
method = "svmRadial",
tuneGrid = expand.grid(C=c.list, sigma=sigma.list),
trControl = tc,
metric = met
)
After tracking this down, it appears this is normal caret behavior.
I think that caret is essentially verifying that your summaryFunction is working properly by passing fake data (of length 10) to it. The function inside caret that is doing this is evalSummaryFunction.
I'm not quite sure what I'm doing in the RStudio's debugger but this code in train.default:
testSummary <- evalSummaryFunction(y, wts = weights,
ctrl = trControl, lev = classLevels, metric = metric,
method = method)
perfNames <- names(testSummary)
calls evalSummaryFunction which looks like:
function (y, wts = NULL, perf = NULL, ctrl, lev, metric, method)
{
n <- if (class(y)[1] == "Surv")
nrow(y)
else length(y)
if (class(y)[1] != "Surv") {
if (is.factor(y)) {
values <- rep_len(levels(y), min(10, n))
pred_samp <- factor(sample(values), levels = lev)
obs_samp <- factor(sample(values), levels = lev)
}
else {
pred_samp <- sample(y, min(10, n))
obs_samp <- sample(y, min(10, n))
}
}
else {
pred_samp <- y[sample(1:n, min(10, n)), "time"]
obs_samp <- y[sample(1:n, min(10, n)), ]
}
testOutput <- data.frame(pred = pred_samp, obs = obs_samp)
if (!is.null(perf)) {
if (is.vector(perf))
stop("`perf` should be a data frame", call. = FALSE)
perf <- perf[sample(1:nrow(perf), nrow(testOutput)),
, drop = FALSE]
testOutput <- cbind(testOutput, perf)
}
if (ctrl$classProbs) {
for (i in seq(along = lev)) testOutput[, lev[i]] <- runif(nrow(testOutput))
testOutput[, lev] <- t(apply(testOutput[, lev], 1, function(x) x/sum(x)))
}
else {
if (metric == "ROC" & !ctrl$classProbs)
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()")
}
if (!is.null(wts))
testOutput$weights <- sample(wts, min(10, length(wts)))
testOutput$rowIndex <- sample(1:n, size = nrow(testOutput))
ctrl$summaryFunction(testOutput, lev, method)
}
It appears that 10 is the length of fake data caret passes to your summary function to evaluate it (make sure it is working properly?).
If anyone can verify/explain better that this is what caret is actually doing, please post.
I am using allFit() in lme4 to automatically scan through possible optimizers, since the default optimizer usually doesn't converge in this situation. My code works fine when I run it line-by-line, but when I run it wrapped in a simple function, it gives different results.
I've looked at the output of the allFit call and it seems that when it's NOT inside the function, it returns a list of lmerModLmerTest objects as desired.
However, inside the function, it returns a list with the values simpleError, error, and condition. Why is it doing this?
I'm using RStudio, R 3.6, lme4 1.1-21, lmerTest 3.1-0.
UPDATE: The problem is that the update() method used by allFit cannot find the 'tt' data frame when re-fitting the models. I have put breakpoints into the code and it seems that the 'test' data exists in the function environment, however, so I don't understand why it can't find it...
UPDATE 2: It appears that if I change the assignment of the test data to <<-, it works. This is dangerous, though, by breaking functional programming, and I think it may fail when I try to parallelize. I am testing further... still open to suggestions!
Here is the code that works, not inside the function:
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
set.seed(2)
test_1 <- multi_arm_var_sim()
model_flex_1 <- lmer(linpred ~ arm + (1 + arm | clustid),
data = test_1)
diff_optims_1 <- allFit(model_flex_1, verbose = TRUE)
print(class(diff_optims_1[[1]]))
is.OK_1 <- sapply(diff_optims_1, is, "lmerMod")
print(is.OK_1)
And here is the code that doesn't work, same setup, wrapped in a function.
library(lme4)
multi_arm_var_sim <- function(nsub = 20, nclust = 100, narm = 2, iccs = c(.01, .04), betas = c(0,.3)){
sig_b2 <- -1*iccs / (iccs - 1)
n <- nsub * nclust * narm
y <- rep_len(NA, n)
arm <- as.factor(rep(0:(narm-1), each = nsub*nclust))
clustid <- rep(1:(nclust*narm), each = nsub)
clustRElist <- rnorm(narm*nclust, mean = 0, sd = rep(sqrt(sig_b2), each = nclust))
clustRE <- rep(clustRElist, each = nsub)
sig_b2 <- rep(sig_b2, each = nclust*nsub)
error <- rnorm(n, mean = 0, sd = 1)
beta <- rep(betas, each = nclust*nsub)
linpred <- beta + clustRE + error
output <- cbind.data.frame(arm, clustid, sig_b2, clustRE, linpred)
return(output)
}
get_pval <- function(){
tt <- multi_arm_var_sim()
model_flex <- lme4::lmer(linpred ~ arm + (1 + arm | clustid),
data = tt)
diff_optims <- lme4::allFit(model_flex, data = tt, verbose = TRUE)
print(class(diff_optims[[1]]))
is.OK <- sapply(diff_optims, is, "merMod")
print(is.OK)
}
set.seed(2)
get_pval()
Thanks!!
I'm wanting to write a function that will (hopefully) work in the raster calculator in the raster package. What I'm trying to do is regress each cell value against a vector of Time. There are multiple examples of this, but what I would like to do is for the method to try 1 type of regression (gls, controlling for AR1 residual errors), but if for some reason that regression throws an error (perhaps there is no AR1 structure in the residuals) then to revert back to simple OLS regression.
I've written two functions for the regression. One for gls:
# function for calculating the trend, variability, SNR, and residuals for each pixel
## this function will control for AR1 structure in the residuals
funTrAR1 <- function(x, ...) {if (sum(is.na(x)) >= 1) { NA } else {
mod <- nlme::gls(x ~ Year, na = na.omit, method = "REML", verbose = TRUE,
correlation = corAR1(form = ~ Year, fixed = FALSE),
control = glsControl(tolerance = 1e-3, msTol = 1e-3, opt = c("nlminb", "optim"),
singular.ok = TRUE, maxIter = 1000, msMaxIter = 1000))
slope <- mod$coefficients[2]
names(slope) <- "Trend"
var <- sd(mod$residuals)
names(var) <- "Variability"
snr <- slope/var
names(snr) <- "SNR"
residuals <- c(stats::quantile(
mod$residuals, probs = seq(0,1,0.25),
na.rm = TRUE, names = TRUE, type = 8),
base::mean(mod$residuals, na.rm = TRUE))
names(residuals) <- c("P0", "P25", "P50", "P75", "P100", "AvgResid")
return(c(slope, var, snr, residuals))}
}
and for OLS:
# function for calculating the trend, variability, SNR, and residuals for each pixel
## this function performs simple OLS
funTrOLS <- function(x, ...) {if (sum(is.na(x)) >= 1) { NA } else {
mod <- lm(x ~ Year, na.action = na.omit)
slope <- mod$coefficients[2]
names(slope) <- "TrendOLS"
var <- sd(mod$residuals)
names(var) <- "VariabilityOLS"
snr <- slope/var
names(snr) <- "SNROLS"
residuals <- c(stats::quantile(
mod$residuals, probs = seq(0,1,0.25),
na.rm = TRUE, names = TRUE, type = 8),
base::mean(mod$residuals, na.rm = TRUE))
names(residuals) <- c("P0", "P25", "P50", "P75", "P100", "AvgResid")
return(c(slope, var, snr, residuals))}
}
I'm trying to wrap these in a tryCatch expression which can be passed to raster::calc
xReg <- tryCatch(
{
funTrAR1
},
error = function(e) {
## this should create a text file if a model throws an error
sink(paste0(inDir, "/Outputs/localOLSErrors.txt"), append = TRUE)
cat(paste0("Used OLS regression (grid-cell) for model: ", m, ". Scenario: ", t, ". Variable: ", v, ". Realisation/Ensemble: ", r, ". \n"))
sink()
## run the second regression function
funTrOLS
}
)
This function is then passed to raster::calc like so
cellResults <- calc(rasterStack, fun = xReg)
Which if everything works will produce a raster stack of the output variables that looks similar to this
However, for some of my datasets the loop that I'm running all of this in stops and I receive the following error:
Error in nlme::gls(x ~ Year, na = na.omit, method = "REML", verbose = TRUE, :
false convergence (8)
Which is directly from nlme::gls and what I was hoping to avoid. I've never used tryCatch before (this might be very obvious), but does anyone know how to get the tryCatch() to move to the second regression function if the first (AR1) regression fails?
Here is another way to code this, perhaps that helps:
xReg <- function(x, ...) {
r <- try(funTrAR1(x, ...), silent=TRUE)
# if (class(r) == 'try-error') {
if (!is.numeric(r)) { # perhaps a faster test than the one above
r <- c(funTrOLS(x, ...), 2)
} else {
r <- c(r, 1)
}
r
}
I add a layer that shows which model was used for each cell.
You can also do
xReg <- function(x, ...) {
r <- funTrOLS(x, ...)
try( r <- funTrAR1(x, ...), silent=TRUE)
r
}
Or use calc twice and use cover after that
xReg1 <- function(x, ...) {
r <- c(NA, NA, NA, NA)
try( r <- funTrAR1(x, ...), silent=TRUE)
r
}
xReg2 <- function(x, ...) {
funTrOLS(x, ...)
}
a <- calc(rasterStack, xReg1)
b <- calc(rasterStack, xReg2)
d <- cover(a, b)
And a will show you where xReg1 failed.
After doing a bit more reading, and also looking at #RobertH answer, I wrote a bit of (very) ugly code that checks if GLS will fail and if it does, performs OLS instead. I'm positive that there is a nicer way to do this, but it works and maintains raster layer names as they were defined in my functions, it also exports any errors to a txt file.
for (i in 1) {
j <- tempCentredRas
cat(paste("Checking to see if gls(AR1) will work for model", m, r,"cell based calculations\n", sep = " "))
### This check is particularly annoying as it has to do this for every grid-cell
### it therefore has to perform GLS/OLS on every grid cell twice
### First to check if it (GLS) will fail, and then again if it does fail (use OLS) or doesn't (use GLS)
possibleLocalError <- tryCatch(
raster::calc(j, fun = funTrAR1),
error = function(err)
err
)
if (inherits(possibleLocalError, "error")) {
cat(paste("GLS regression failed for model", m, r, "using OLS instead for cell based results.","\n", sep = " "))
cellResults <- raster::calc(j, fun = funTrOLS)
} else {
cellResults <- raster::calc(j, fun = funTrAR1)
}
}
I have a question regarding doing parallel simulation using foreach and I really appreciate to your help in advance as I have spent hours and hours and I keep getting errors.
I'm using foreach() for a simulation study where in each iteration I run a function that that function itself has some "regular" loops. When I run it sequentially, it works perfectly fine. On a linux server, when I run it using foreach in parallel inside R, it works fine but as soon as I run it in the batch mode using "nohup R CMD BATCH" command, it starts returning errors. I set the seed numbers manually for each simulation and it means that regardless of running them parallel inside R, parallel in the batch mode, and sequential inside R, my codes basically are run on the same simulated data so that if one way does not produce any error, others should not lead to error either but strangely, it's not the case. Has anyone had the same challenge?
I've spent hours and hours, putting exception handling but nothing happened.
Here is my code:
dataSim <- function(seedNum, n, mi, beta0FE, beta1, beta0RE){
# Setting the seed number:
set.seed(seedNum)
# Generating covariates:
x <- rnorm(n, mean = 0, sd = 1)
# Generating Y:
data <- data.frame(id = rep(1:n, each = mi), x = rep(x, each = mi),
beta0_RE = rep(beta0RE, each = mi))
etaTmp <- data$beta0_RE + beta0FE + beta1*data$x
piTmp <- exp(etaTmp)/(1 + exp(etaTmp))
data$Y <- sapply(piTmp, rbinom, n = 1, size = 1)
return(data)
}
# Data Simulation Parameters:
n <- 100
mi <- 30
beta0FE <- 0.3
beta1 <- 1
beta0RE <- rnorm(n, 0, 1)
# Simulation Parameters:
nSim <- 10000
nIter <- 4000
LME_Freq_Fun <- function(data){
fit.LME <- try(glmmPQL(Y ~ x, random = ~ 1 | id,
family = binomial, data = data))
print(class(fit.LME)[1])
if (class(fit.LME)[1] == "try-error"){
return(list(coef.FE.LME = rep(NA, 3),
coef.RE.LME = rep(NA, n)))
}else{
return(list(coef.FE.LME = fit.LME$coefficients$fixed,
coef.RE.LME = fit.LME$coefficients$random$id))
}
}
LME_Freq_RSLT <- foreach (i = 1:nSim) %dopar%{
print(paste("i=", i))
print("-------")
data <- dataSim(i, n, mi, beta0FE, beta1, beta0RE)
LME_Freq_Fun(data)
}
# Extracting elements from foreach:
Coef.Fixed.LME <- t(sapply(1:nSim ,function (i)
return(LME_Freq_RSLT[[i]][[1]] )))
Coef.Rand.LME <- t(sapply(1:nSim ,function (i)
return(LME_Freq_RSLT[[i]][[2]] )))
With no "try", I keep getting:
task 3527 failed - "$ operator is invalid for atomic vectors"
With try, my Coef.Fixed.LME becomes very strange with all elements "numeric.2" !
Hi I am trying to understand how to get DEoptim to work using parallel processing, but am struggling to get the correct parameters to be put into the function to get it to work...below is a reproducible example (it has a financial context) but it is designed for creating a random portfolio of 7 assets to optimise for ES. It was inspired by this http://mpra.ub.uni-muenchen.de/28187/1/RJwrapper.pdf and also http://files.meetup.com/1772780/20120201_Ulrich_Parallel_DEoptim.pdf
the second of which does include a parallel option but want to use the unix forking rather than the SOCK clusters.
require(quantmod)
require(PerformanceAnalytics)
require(DEoptim)
tickers <- c("^GSPC","^IXIC","^TNX", "DIA","USO","GLD","SLV","UNG","^VIX","F","^FTSE","GS","MS","MSFT","MCD","COKE","AAPL","GOOG","T","C","BHP","RIO","CMG")
getSymbols(tickers)
tickers <- gsub("\\^","",tickers)
x <- lapply(tickers, function(x){ClCl(get(x))})
comb <- na.omit(do.call(merge,x))
colnames(comb) <- paste0(tickers,".cc")
obj <- function(w) {
if (sum(w) == 0) { w <- w + 1e-2 }
w <- w / sum(w)
CVaR <- ES(weights = w,
method = "gaussian",
portfolio_method = "component",
mu = mu,
sigma = sigma)
tmp1 <- CVaR$ES
tmp2 <- max(CVaR$pct_contrib_ES - 0.225, 0)
out <- tmp1 + 1e3 * tmp2
}
comb1 <- comb[,sample(1:ncol(comb),7)]
no.of.assets <- ncol(comb1)
mu <- colMeans(comb1)
sigma <- cov(comb1)
## The non-parallel version
output <- DEoptim(fn = obj,lower = rep(0, no.of.assets), upper = rep(1, no.of.assets))
## The parallel version that doesn't seem to work...
output <- DEoptim(fn = obj,lower = rep(0, no.of.assets), upper = rep(1, no.of.assets), DEoptim.control(itermax=10000, trace=250, parallelType="parallel", packages=c("PerformanceAnalytics"), parVar=c("mu","sigma")))
I get the following error message
Error in missing(packages) : 'missing' can only be used for arguments