I'm wanting to write a function that will (hopefully) work in the raster calculator in the raster package. What I'm trying to do is regress each cell value against a vector of Time. There are multiple examples of this, but what I would like to do is for the method to try 1 type of regression (gls, controlling for AR1 residual errors), but if for some reason that regression throws an error (perhaps there is no AR1 structure in the residuals) then to revert back to simple OLS regression.
I've written two functions for the regression. One for gls:
# function for calculating the trend, variability, SNR, and residuals for each pixel
## this function will control for AR1 structure in the residuals
funTrAR1 <- function(x, ...) {if (sum(is.na(x)) >= 1) { NA } else {
mod <- nlme::gls(x ~ Year, na = na.omit, method = "REML", verbose = TRUE,
correlation = corAR1(form = ~ Year, fixed = FALSE),
control = glsControl(tolerance = 1e-3, msTol = 1e-3, opt = c("nlminb", "optim"),
singular.ok = TRUE, maxIter = 1000, msMaxIter = 1000))
slope <- mod$coefficients[2]
names(slope) <- "Trend"
var <- sd(mod$residuals)
names(var) <- "Variability"
snr <- slope/var
names(snr) <- "SNR"
residuals <- c(stats::quantile(
mod$residuals, probs = seq(0,1,0.25),
na.rm = TRUE, names = TRUE, type = 8),
base::mean(mod$residuals, na.rm = TRUE))
names(residuals) <- c("P0", "P25", "P50", "P75", "P100", "AvgResid")
return(c(slope, var, snr, residuals))}
}
and for OLS:
# function for calculating the trend, variability, SNR, and residuals for each pixel
## this function performs simple OLS
funTrOLS <- function(x, ...) {if (sum(is.na(x)) >= 1) { NA } else {
mod <- lm(x ~ Year, na.action = na.omit)
slope <- mod$coefficients[2]
names(slope) <- "TrendOLS"
var <- sd(mod$residuals)
names(var) <- "VariabilityOLS"
snr <- slope/var
names(snr) <- "SNROLS"
residuals <- c(stats::quantile(
mod$residuals, probs = seq(0,1,0.25),
na.rm = TRUE, names = TRUE, type = 8),
base::mean(mod$residuals, na.rm = TRUE))
names(residuals) <- c("P0", "P25", "P50", "P75", "P100", "AvgResid")
return(c(slope, var, snr, residuals))}
}
I'm trying to wrap these in a tryCatch expression which can be passed to raster::calc
xReg <- tryCatch(
{
funTrAR1
},
error = function(e) {
## this should create a text file if a model throws an error
sink(paste0(inDir, "/Outputs/localOLSErrors.txt"), append = TRUE)
cat(paste0("Used OLS regression (grid-cell) for model: ", m, ". Scenario: ", t, ". Variable: ", v, ". Realisation/Ensemble: ", r, ". \n"))
sink()
## run the second regression function
funTrOLS
}
)
This function is then passed to raster::calc like so
cellResults <- calc(rasterStack, fun = xReg)
Which if everything works will produce a raster stack of the output variables that looks similar to this
However, for some of my datasets the loop that I'm running all of this in stops and I receive the following error:
Error in nlme::gls(x ~ Year, na = na.omit, method = "REML", verbose = TRUE, :
false convergence (8)
Which is directly from nlme::gls and what I was hoping to avoid. I've never used tryCatch before (this might be very obvious), but does anyone know how to get the tryCatch() to move to the second regression function if the first (AR1) regression fails?
Here is another way to code this, perhaps that helps:
xReg <- function(x, ...) {
r <- try(funTrAR1(x, ...), silent=TRUE)
# if (class(r) == 'try-error') {
if (!is.numeric(r)) { # perhaps a faster test than the one above
r <- c(funTrOLS(x, ...), 2)
} else {
r <- c(r, 1)
}
r
}
I add a layer that shows which model was used for each cell.
You can also do
xReg <- function(x, ...) {
r <- funTrOLS(x, ...)
try( r <- funTrAR1(x, ...), silent=TRUE)
r
}
Or use calc twice and use cover after that
xReg1 <- function(x, ...) {
r <- c(NA, NA, NA, NA)
try( r <- funTrAR1(x, ...), silent=TRUE)
r
}
xReg2 <- function(x, ...) {
funTrOLS(x, ...)
}
a <- calc(rasterStack, xReg1)
b <- calc(rasterStack, xReg2)
d <- cover(a, b)
And a will show you where xReg1 failed.
After doing a bit more reading, and also looking at #RobertH answer, I wrote a bit of (very) ugly code that checks if GLS will fail and if it does, performs OLS instead. I'm positive that there is a nicer way to do this, but it works and maintains raster layer names as they were defined in my functions, it also exports any errors to a txt file.
for (i in 1) {
j <- tempCentredRas
cat(paste("Checking to see if gls(AR1) will work for model", m, r,"cell based calculations\n", sep = " "))
### This check is particularly annoying as it has to do this for every grid-cell
### it therefore has to perform GLS/OLS on every grid cell twice
### First to check if it (GLS) will fail, and then again if it does fail (use OLS) or doesn't (use GLS)
possibleLocalError <- tryCatch(
raster::calc(j, fun = funTrAR1),
error = function(err)
err
)
if (inherits(possibleLocalError, "error")) {
cat(paste("GLS regression failed for model", m, r, "using OLS instead for cell based results.","\n", sep = " "))
cellResults <- raster::calc(j, fun = funTrOLS)
} else {
cellResults <- raster::calc(j, fun = funTrAR1)
}
}
Related
I have a function that uses stats::nls() internally to get parameter estimates for a non-linear model. However, the number of parameters I need the function to estimate is variable and determined by the users. How can I do this? This function is going in an R package, so it is important that it is as flexible as possible.
For example, here is some dummy data and two possible functions that users might use.
## dummy data
set.seed(654)
df <- data.frame(d = runif(50))
df$y <- exp(-df$d/.1)
df$x <- df$y + abs(rnorm(50, sd = .2))
## functions with different numbers of arguments
exp_fun <- function(d, r){
return(exp(-d/r))
}
exppow_fun <- function(d, r, a){
return(exp(-(d/r)^a))
}
The goal is to have a function called fit_nls(), which takes at least 3 arguments, x, d, and FUN. FUN is a function that takes d as its first argument, but can have any number of additional parameters, and outputs some transformation of d:
# run with any FUN
fit_nls(d = df$d, x = df$x, FUN = "exp_fun", ...)
fit_nls(d = df$d, x = df$x, FUN = "exppow_fun", ...)
fit_nls(d = df$d, x = df$x, FUN = function(d, a, b, c, e){...}), ...)
fit_nls(d = df$d, x = df$x, FUN = function(d){...}, ...)
I can make the function work for a fixed number of arguments:
fit_nls <- function(d, x, FUN, start){
fit.fun <- match.fun(FUN)
nls(x ~ fit.fun(d, r = r), start = start)
}
fit_nls(df$d, df$x, "exp_fun", start = list(r = .1))
fit_nls(df$d, df$x, function(d, r){d^r}, start = list(r = .1))
but haven't been able to figure out how to use a variable number of parameters. One thing I've tried is passing a list of arguments using do.call(), but this doesn't work:
fit_nls.multiarg <- function(d, x, FUN, start){
fit.fun <- match.fun(FUN)
args = append(list(d = d), start)
nls(x ~ do.call(fit.fun, args), start = start)
}
fit_nls.multiarg(df$d, df$x, "exp_fun", list(r = .01)) # error
which isn't really surprising, since it is equivalent to setting the values within the function:
nls(df$x ~ exp_fun(df$d, r = .1), start = list(r = .01) # equivalent error
So, I tried passing a symbol as a stand-in, without success:
fit_nls.symbol <- function(d, x, FUN, start){
fit.fun <- match.fun(FUN)
nam = names(start)
args = append(list(d = d), as.symbol(nam))
nls(x ~ do.call(fit.fun, args), start = start)
}
fit_nls.symbol(df$d, df$x, "exp_fun", list(r = .01)) # error
I'm open to any kind of solution. If anyone can give me any advice or point me in the right direction, I'd greatly appreciate it.
If a character string or function name is passed set FUN to the name of the function as a character string; otherwise, use "fit.fun". Then create the formula argument as a character string, convert it to an actual R formula and then run nls.
fit_nls <- function(d, x, FUN, start) {
FUN <- if (length(match.call()[[4]]) > 1) {
fit.fun <- match.fun(FUN)
"fit.fun"
} else deparse(substitute(FUN))
fo <- as.formula(sprintf("x ~ %s(d, %s)", FUN, toString(names(start))))
nls(fo, start = start)
}
Tests
with(df, fit_nls(d, x, "exp_fun", list(r = .01)))
## Nonlinear regression model
## model: x ~ exp_fun(d, r)
## data: parent.frame()
## r
## 0.1968
## residual sum-of-squares: 1.319
##
## Number of iterations to convergence: 11
## Achieved convergence tolerance: 6.254e-06
with(df, fit_nls(d, x, function(d, r){d^r}, start = list(r = .1)))
## Nonlinear regression model
## model: x ~ fit.fun(d, r)
## data: parent.frame()
## r
## 96.73
## residual sum-of-squares: 4.226
##
## Number of iterations to convergence: 22
## Achieved convergence tolerance: 7.429e-06
with(df, fit_nls(d, x, exp_fun, list(r = .01)))
## Nonlinear regression model
## model: x ~ exp_fun(d, r)
## data: parent.frame()
## r
## 0.1968
## residual sum-of-squares: 1.319
##
## Number of iterations to convergence: 11
## Achieved convergence tolerance: 6.254e-06
I am using recursive feature elimination from the R package 'caret'
Linear regression works fine for my problem, therfore I am using functions = lmFuncs insinde my control function.
But I would like to test this setup again without an intercept, is this possible?
My current code:
control <- rfeControl(functions = lmFuncs
, verbose = FALSE
)
results <- rfe(df_train
, df_train
, rfeControl=control
)
I would also go for a custom function, but I do not know how.
Many thanks in advance.
Edit:
I found the answer after having a deeper look into the caret package.
lmFuncs without Intercept:
lmFuncs_wo_intercept <- list(
summary = defaultSummary,
fit = function(x, y, first, last, ...) {
tmp <- if(is.data.frame(x)) x else as.data.frame(x, stringsAsFactors = TRUE)
tmp$y <- y
#lm(y~., data = tmp) #old
lm(y~0+., data = tmp) #new
},
pred = function(object, x) {
if(!is.data.frame(x)) x <- as.data.frame(x, stringsAsFactors = TRUE)
predict(object, x)
},
rank = function(object, x, y) {
coefs <- abs(coef(object))
#coefs <- coefs[names(coefs) != "(Intercept)"] # old
coefs[is.na(coefs)] <- 0
vimp <- data.frame(Overall = unname(coefs),
var = names(coefs))
rownames(vimp) <- names(coefs)
vimp <- vimp[order(vimp$Overall, decreasing = TRUE),, drop = FALSE]
vimp
},
selectSize = pickSizeBest,
selectVar = pickVars
)
lmFuncs is your linear regression? I this case you could try to fit a second linear regression without the intercept and then apply the feature elimination function
I'm tuning parameters with custom summaryFunction in caret.
I originally thought that if I set K-fold cross validation and input data has N points, performance will be measured with N/K data points.
However, apparently it seems not correct because when I extract data$pred by using browser() which is the handed data to summary function, it only had 10 data.
Since the input(df) has over 500 data points, this number is way smaller than my expectation.
Why does it only have 10 data? Is there any way to increase this?(=performance testing with more large data points)
Any kind of help is needed. Thank you.
sigma.list <- seq(1, 5, 1)
c.list <- seq(1, 10, 1)
met <- "FValue"
#define evaluation function
eval <- function(data, lev = NULL, model = NULL){
mat <- table(data$pred, data$obs)
pre <- mat[1,1]/sum(mat[1,]) #precision
rec <- mat[1,1]/sum(mat[,1]) #recall
res <- c("Precision"=pre, "Recall"=rec, "FValue"=2*pre*rec/(pre+rec))
browser()
res
}
#define train control
tc <- trainControl(method = "cv",
number = 5,
summaryFunction = eval,
classProbs = TRUE,
)
#tune with caret
svm.tune <- train(Flag~.,
data = df,
method = "svmRadial",
tuneGrid = expand.grid(C=c.list, sigma=sigma.list),
trControl = tc,
metric = met
)
After tracking this down, it appears this is normal caret behavior.
I think that caret is essentially verifying that your summaryFunction is working properly by passing fake data (of length 10) to it. The function inside caret that is doing this is evalSummaryFunction.
I'm not quite sure what I'm doing in the RStudio's debugger but this code in train.default:
testSummary <- evalSummaryFunction(y, wts = weights,
ctrl = trControl, lev = classLevels, metric = metric,
method = method)
perfNames <- names(testSummary)
calls evalSummaryFunction which looks like:
function (y, wts = NULL, perf = NULL, ctrl, lev, metric, method)
{
n <- if (class(y)[1] == "Surv")
nrow(y)
else length(y)
if (class(y)[1] != "Surv") {
if (is.factor(y)) {
values <- rep_len(levels(y), min(10, n))
pred_samp <- factor(sample(values), levels = lev)
obs_samp <- factor(sample(values), levels = lev)
}
else {
pred_samp <- sample(y, min(10, n))
obs_samp <- sample(y, min(10, n))
}
}
else {
pred_samp <- y[sample(1:n, min(10, n)), "time"]
obs_samp <- y[sample(1:n, min(10, n)), ]
}
testOutput <- data.frame(pred = pred_samp, obs = obs_samp)
if (!is.null(perf)) {
if (is.vector(perf))
stop("`perf` should be a data frame", call. = FALSE)
perf <- perf[sample(1:nrow(perf), nrow(testOutput)),
, drop = FALSE]
testOutput <- cbind(testOutput, perf)
}
if (ctrl$classProbs) {
for (i in seq(along = lev)) testOutput[, lev[i]] <- runif(nrow(testOutput))
testOutput[, lev] <- t(apply(testOutput[, lev], 1, function(x) x/sum(x)))
}
else {
if (metric == "ROC" & !ctrl$classProbs)
stop("train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()")
}
if (!is.null(wts))
testOutput$weights <- sample(wts, min(10, length(wts)))
testOutput$rowIndex <- sample(1:n, size = nrow(testOutput))
ctrl$summaryFunction(testOutput, lev, method)
}
It appears that 10 is the length of fake data caret passes to your summary function to evaluate it (make sure it is working properly?).
If anyone can verify/explain better that this is what caret is actually doing, please post.
I would like to write estimation function to estimate model parameters. I wrote my function without any error, however optim function does not work as expected. When I run my code line by line I found that my condition returns me warnings messages (the condition has length > 1 and only the first element will be used). So, I think this is the problem that makes optim does not work as expected. That because, when I run my code line by line, I got this message and when I run optim function I got this:
~Error in optim(par = start.parm, fn = t_LL, method = "L-BFGS-B", lower = low, :
object 'low' not found
However, I am not sure.
Here is my code:
library(VineCopula)
simdata <– BiCopSim(300, 5, -2)
## I call my function like this:
Myfun <- MLE(simdata, family = 5, par = -2, par2 = 0)
MLE <- function(data, family, par,par2) {
n <- dim(data)[1]
start.parm <- c(par,par2)
if (family %in% c(3, 13)) {
low <- 1e-04
up <- 100
} else if (family %in% c(4, 14)) {
low <- 1.0001
up <- 100
} else if (family %in% c(5)) {
low <- -100
up <- 100
}
t_LL <- function(param, family, start, start2) {
start <- param[[1]]
start2 <- param[[2]]
ll <- sum(log(BiCopPDF(data[,1], data[,2], family, start, start2)))
return(ll)
}
optimout <- optim(par = start.parm,
fn = t_LL,
family= family,
start=start,
start2=start2,
method = "L-BFGS-B",
lower = low,
upper = up,
control = list(fnscale = -1, maxit = 500))
out <- list()
out$par <- optimout$par[1]
out$value <- optimout$value
out
}
Any help, please?
The problem that you're having in your real case is that you're: 1) passing a vector argument to family; 2) only its first value gets used for the if block; 3) it doesn't match any of the values you check for; and 4) as a result low and up don't get assigned.
If you want to be able to pass a vector argument family, take a look at ifelse to replace your if block. If not, throw an error if length(family) > 1 and add an else to your if block to throw an error if the given family doesn't match any of your choices.
Below is a MWE of my problem: I have programmed a progress bar for some function using the bootstrap (via the boot function from the boot package).
This works fine as long as I don't use parallel processing (res_1core below). If I want to use parallel processing by setting parallel = "multicore" and ncpus = 2, the progress bar isn't displayed properly (res_2core below).
library(boot)
rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
env <- environment()
counter <- 0
progbar <- txtProgressBar(min = 0, max = R, style = 3)
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data = d)
curVal <- get("counter", envir = env)
assign("counter", curVal + 1, envir = env)
setTxtProgressBar(get("progbar", envir = env), curVal + 1)
return(summary(fit)$r.square)
}
res <- boot(data = data, statistic = bootfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
return(res)
}
res_1core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000)
res_2core <- rsq(mpg ~ wt + disp, data = mtcars, R = 1000, parallel = "multicore", ncpus = 2)
I have read that this is related to the fact that the boot function calls on lapply for single core processing and mclapply for multicore processing. Does anyone know of an easy workaround to deal with this? I mean, I would like to display the progress taking into account all of the parallel processes.
Update
Thanks to the input of Karolis Koncevičius, I have found a workaround (just use the updated rsq function below):
rsq <- function(formula, data, R, parallel = c("no", "multicore", "snow"), ncpus = 1) {
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data = d)
return(summary(fit)$r.square)
}
env <- environment()
counter <- 0
progbar <- txtProgressBar(min = 0, max = R, style = 3)
flush.console()
intfun <- function(formula, data, indices) {
curVal <- get("counter", envir = env) + ncpus
assign("counter", curVal, envir = env)
setTxtProgressBar(get("progbar", envir = env), curVal)
bootfun(formula, data, indices)
}
res <- boot(data = data, statistic = intfun, R = R, formula = formula, parallel = parallel, ncpus = ncpus)
return(res)
}
Unfortunately, this only works for multicore processing when I run R from the terminal. Any ideas how to patch this so it also displays properly in R console or Rstudio?
Not exactly what you ordered, but might be helpful.
A simple statistics function to boot:
library(boot)
bootfun <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data=d)
summary(fit)$r.square
}
Higher order function to display progress:
progressReporter <- function(total, nBars=100, f, ...) {
count <- 1
step <- ceiling(total/nBars)
cat(paste(rep("|", nBars), collapse=""), "\r")
flush.console()
function(...) {
if (count %% step==0) {
cat(".")
}
count <<- count + 1
f(...)
}
}
Now this function is cheating - it displays progress every "step" of iterations. If you have 1000 iterations, use two cores and print every 10th iteration - it will do the job. The cores don't share state, but they each will run the counter up to 500, and the function will respond to both counters.
On the other hand if you do 1000 iterations, run 10 cores and report every 200 - the function will stay silent, as all the cores will count to 100 each. None will reach 200 - no progress bar. Hope you get the idea. I think it should be ok in most of the cases.
Try it out:
res_1core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun))
res_2core <- boot(formula="mpg ~ wt + disp", data=mtcars, R=1000, statistic=progressReporter(1000, nBars=100, f=bootfun), parallel="multicore", ncpus=2)