Predicting data from gamlss model in handler function using tryCatch in R - r

I am having a problem using the tryCatch() function in R in a function I created.
What I want to do is this:
simulate data based on model results
analyze simulated data using my gamlss model
use the predict function to extract model predictions over a new range of values
store these predictions in a data frame
do this many times
My main problem is that my model is somewhat unstable and once in a while predictions are kind of wild, which in turn generates an error when I try to analyze it with gamlss. My objective is to write a tryCatch statement within my simulation function and to basically simply run the simulation/prediction code a second time in the event that an error occurs. (I know this is not optimal, I could also write it in a recursive statement using repeat for example and run it until I don't get an error but I get few enough errors that the probability of getting two in a row is quite low, and I'm having enough troube with this task as it is.)
So I simplified my code as much as I could and created a dummy dataframe for which the modelling still works.
I wrote in the code where I believe the error is (with the predict function which does not find the mod_sim object). It is likely there since the cat just above this line prints while the one just below doesn't print.
I think there are some things about how tryCatch works that I don't understand well enough and I'm having a hard time to understand which objects are kept in which parts of functions and when they can be called or not...
Here is the code I have so far. The error occurs at l.84 (identified in the script). The data and code can be found here.
library(tidyverse)
library(gamlss)
library(gamlss.dist)
#Load data
load('DHT.RData')
#Run original model
mod_pred<-gamlss(harvest_total ~ ct,
data = DHT,
family = DPO)
#Function to compute predictions based on model
compute_CI_trad_gamlss<-function(n.sims=200, mod){#,
#DF for simulations
df_sims<-as.data.frame(DHT)
#Dateframe with new data to predict over
new.data.ct<<-expand.grid(ct=seq(from=5, to=32, length.out=50))
#matrix to store predictions
preds.sim.trad.ct <<- matrix(NA, nrow=nrow(new.data.ct), ncol=n.sims)
#Number of obs to simulate
n<-nrow(df_sims)
#Simulation loop (simulate, analyze, predict, write result)
for(i in 1:n.sims){
#Put in tryCatch to deal with potential error on first run
tryCatch({
#Create matrix to store results of simulation
y<-matrix(NA,n,1)
#in DF for simulations, create empty row to be filled by simulated data
df_sims$sim_harvest<-NA
#Loop to simulate observations
for(t in 1:n){
#Simulate data based on model parameters
y[t]<-rDPO(n=1, mu=mod$mu.fv[t], sigma = mod$sigma.fv[t])
}#enf of simulation loop
#Here I want the result of the simulation loop to be pasted in the df_sims dataset
df_sims$sim_harvest<-y
#Analysis of simulated data
mod_sim<-gamlss(sim_harvest ~ ct,
data = df_sims,
family = DPO)
#Refit the model if convergence not attained
if(mod_sim$converged==T){
#If converged do nothing
} else {
#If not converged refit model
mod_sim<-refit(mod_sim)
}
cat('we make it to here!\n')
#Store results in object
ct <<-as.vector(predict(mod_sim, newdata = new.data.ct, type='response'))
cat('but not to here :( \n')
#If we made it down here, register err as '0' to be used in the if statement in the 'finally' code
err<<-0
},
#If error register the error and write it!
error = function(e) {
#If error occured, show it
cat('error at',i,'\n')
#Register err as 1 to be used in the if statement in the finally code below
err<<-1
},
finally = {
if(err==0){
#if no error, do nothing and keep going outside of tryCatch
}#End if err==0
else if (err==1){
#If error, re-simulate data and do the analysis again
y<-matrix(NA,n,1)
df_sims$sim_harvest<-NA
#Loop to simulate observations
for(t in 1:n){
#Simuler les données basées sur les résultats du modèle
y[t]<-rDPO(n=1, mu=mod$mu.fv[t], sigma = mod$sigma.fv[t])
}#enf of simulation loop
#Here I want the result of the simulation loop to be pasted in the df_sims dataset
df_sims$sim_harvest<-y
#Analysis of simulated data
mod_sim<-gamlss(sim_harvest ~ ct,
data = df_sims,
family = DPO)
cat('we also make it here \n')
#Store results in object
ct <<-as.vector(predict(mod_sim, newdata = new.data.ct, type='response'))
cat('but not here... \n')
}#End if err==1,
}#End finally
)#End tryCatch
#Write predictions for this iteration to the DF and start over
preds.sim.trad.ct[,i] <<-ct
#Show iteration number
cat(i,'\n')
}
#Do some more stuff here
#Return results
return(preds = list(ct= list(predictions=preds.sim.trad.ct)))
}
#Run simulation and store object
result<-compute_CI_trad_gamlss(n.sims=20, mod=mod_pred)
Anyway I hope someone can help!
Thanks a lot!

So after a bit of trial and error I managed to make it work. I believe the problem lies in the mod_sim object that is not saved to the global environment. predict (or predict.gamlss here) is probably not looking in the function environment for the mod_sim object although I don't understand why it wouldn't. Anyway using <<- (i.e. assigning the object in the global environment from the function) for every object created in the function seemed to do the trick. If anyone has an explanation on why this happens though I'd be glad to understand what I'm doing wrong!

Related

Using a For Loop to Run Multiple Response Variables through a Train function to create multiple seperate models in R

I am trying to create a for loop to index thorugh each individual response variable I have and train a model using the train() funciton within the Caret Package. I have about 30 response variable and 43 predictor variables. I can train each model individually but I would like to automate the process and have a for loop run through a model (I would like to eventually upscale to multiple models if possible, i.e. lm, rf, cubist, etc.). I then want to save each model to a dataframe along with R-squared values and RMSE values. The individual models that I currenlty have that will run for me goes as follows, with column 11 being the response variable and column 35-68 being predictor variables.
data_Mg <- subset(data_3, !is.na(Mg))
mg.lm <- train(Mg~., data=data_Mg[,c(11,35:68)], method="lm", trControl=control)
mg.cubist <- train(Mg~., data=data_Mg[,c(11,35:68)], method="cubist", trControl=control)
mg.rf <- train(Mg~., data=data_Mg[,c(11,35:68)], method="rf", trControl=control, na.action = na.roughfix)
max(mg.lm$results$Rsquared)
min(mg.lm$results$RMSE)
max(mg.cubist$results$Rsquared)
min(mg.cubist$results$RMSE)
max(mg.rf$results$Rsquared) #Highest R squared
min(mg.rf$results$RMSE)
This gives me 3 models with everything the relevant information that I need. Now for the for loop. I've only tried the lm model so far for this.
bucket <- list()
for(i in 1:ncol(data_4)) { # for-loop response variables, need to end it at response variables, rn will run through all variables
data_y<-subset(data_4, !is.na(i))#get rid of NA's in the "i" column
predictors_i <- colnames(data_4)[i] # Create vector of predictor names
predictors_1.1 <- noquote(predictors_i)
i.lm <- train(predictors_1.1~., data=data_4[,c(i,35:68)], method="lm", trControl=control)
bucket <- i.lm
#mod_summaries[[i - 1]] <- summary(lm(y ~ ., data_y[ , c("i.lm", predictors_i)]))
#data_y <- data_4
}
Below is the error code that I am getting, with Bulk_Densi being the first variable in predictors_1.1. The error code is that variable lengths differ so I originally thought that my issue was that quotes were being added around "Bulk_Densi" but after trying the NoQuote() function I have not gotten anywehre so I am unsure of where I am going wrong.
Error code that I am getting
Please let me know if I can provide any extra info and thanks in advance for the help! I've already tried the info in How to train several models within a loop for and was struggling with that as well.

Save customized function inside function in MLFlow log_model

I would like to do something with MLFlow but I do not find any solution on Internet. I am working with MLFlow and R, and I want to save a regression model. The thing is that by the time I want to predict the testing data, I want to do some transformation of that data. Then I have:
data <- #some data with numeric regressors and dependent variable called 'y'
# Divide into train and test
ind <- sample(nrow(data), 0.8*nrow(data), replace = FALSE)
dataTrain <- data[ind,]
dataTest <- data[-ind,]
# Run model in the mlflow framework
with(mlflow_start_run(), {
model <- lm(y ~ ., data = dataTrain)
predict_fun <- function(model, data_to_predict){
data_to_predict[,3] <- data_to_predict[,3]/2
data_to_predict[,4] <- data_to_predict[,4] + 1
return(predict(model, data_to_predict))
}
predictor <- crate(~predict_fun(model,dataTest),model)
### Some code to use the predictor to get the predictions and measure the accuracy as a log_metric
##################
##################
##################
mlflow_log_model(predictor,'model')
}
As you can notice, my prediction function not only consists in predict the new data you are evaluating, but it also makes some transformations in the third and fourth columns. All examples I saw on the web use the function predict in the crate as the default function of R.
Once I save this model, when I run it in another notebook with some Test data, I get the error: "predict_fun" doesn't exist. That is because my algorithm has not saved this specific function. Do you know what can I do to save and specific prediction function that I have created instead of the default functions that are in R?
This is not the real example I am working with, but it is an approximation of it. The fact is that I want to save extra functions apart from the model itself.
Thank you very much!

'Invalid parent values' error when running JAGS from R

I am running a simple generalized linear model, calling JAGS from R. The model is negatively binomially distributed. The model is being fitted to data on counts of fish, with the majority of individual counts ('C' in the data set below) being zeros.
I initially ran the model with one covariate, temperature ('Temp'). About half of the time the model ran and the other half of the time the model gave me the error, 'Error in node C[###] Invalid parent values.' The value for C[###] in the error message changes with each successive attempt to run the model.
Since my success at running the model was inconsistent, I tried adding another covariate, salinity ('Salt'). Then the model would not run at all, with the same error message as above.
Any ideas or suggestions on the source of the error are greatly appreciated.
I am suspecting that the initial values for the dispersion parameter, r, may be the issue. Ideally I add several more covariates into model fitting if this error can be addressed.
The data set and code are immediately below. For sake of getting the data to load properly on this website, I have omitted 662 of the 672 total values; even with the reduced data set (n = 10 instead of n = 672) the problem remains.
Thank you.
setwd("C:/Users/John/Desktop")
library('coda')
library('rjags')
library('R2jags')
set.seed(1000000000)
#data
n=10
C=c(0,0,0,0,0,1,0,0,0,1)
Temp=c(0,29.3,25.3,28.7,28.7,24.4,25.1,25.1,24.2,23.3)
Salt=c(6,6,0,6,6,0,12,12,6,12)
sink("My Model.txt")
cat("
model {
r~dunif(0,10)
beta0~dunif (-20,20)
beta1~dunif (-20,20)
beta2~dunif (-20,20)
for (i in 1:n) {
C[i] ~ dnegbin(p[i], r)
p[i] <- r/(r+lambda[i])
log(lambda[i]) <- mu[i]
mu[i] <- beta0 + beta1*Temp[i] + beta2*Salt[i]
}
}
", fill=TRUE)
sink()
n=n
C=C
Temp=Temp
Salt=Salt
#bundle data
bugs.data = list(
"n",
"C",
"Temp",
"Salt")
#parameters to monitor
params<-c(
"r",
"beta0",
"beta1",
"beta2")
#initial values
inits <- function(){list(
r=floor(runif(1,0,5)),
beta0=runif(1,-5,5),
beta1=runif(1,-5,5),
beta2=runif(1,-5,5))}
model.file <- 'My Model.txt'
jagsfit <- jags(data=bugs.data, inits=inits, params, n.iter=1000, n.thin=10, n.burnin=100, model.file)
print(jagsfit, digits=5)
This works fine for me most of the time, but it would fail with the error you describe if the inits function samples a value of r of 0 - which you have made more likely by using floor() in the inits function (not sure why you did that - r is not restricted to integers but is strictly positive). Also, every time you run the model you will get different initial values (unless setting a random seed in R) which is making your life more complicated that it needs to be. I generally recommend picking fixed (and probably over dispersed) initial values, such as r=0.01 and r=10 for the two chains in your example.
However, JAGS picks usable initial values for this model as you can see by not providing your own inits e.g.:
library('runjags')
listdata <- lapply(bugs.data, get)
names(listdata) <- unlist(bugs.data)
run.jags(model.file, params, listdata)
I would also have a think about the prior you are using for r - it could well be that this will have a bigger effect on your posterior than intended. Another (not necessarily better) option is something like a gamma prior.
Matt

For loops regression in R

I'm fitting GARCH model to the residuals of and ARIMA, and trying to apply ARCH(p) for p from 1 to 10 to compare the fitness. Here is my code. Errors are returned in the for loop part but I cannot figure out the reason why. Could anyone give some tips?
So for the single value p=1 the codes are as below and it's no problem.
fitone<- garchFit(~garch(1,0),data=logprice)
coef(fitone)
summary(fitone)
And for the for loop my codes go like
for (n in 1:10) {
fit [[n]]<- garchFit(~garch(n,0),data=logprice)
coef(fit[[n]])
summary(fit[[n]])
}
Error in .garchArgsParser(formula = formula, data = data, trace = FALSE) :
Formula and data units do not match.
I never wrote a loop code before. Can someone help me with the codes?
The problem is that generally one tries to evaluate all the variables in a formula in the context of the data= parameter, but your n variable isn't coming from logprice, it's coming from the global environment. You will need to dynamically create the formula. Here's one way to run all the models with lapply rather than a for look would be
library(fGarch)
#sample data
x.vec = as.vector(garchSim(garchSpec(rseed = 1985), n = 200)[,1])
fits <- lapply(1:10, function(n) {
garchFit(bquote(~garch(.(n),0)), data = x.vec, trace = FALSE)
})
and then we can get the coefs with
lapply(fits, coef)

R - Saving Variable to dataframe from own function

again I'm stuck...
I want to write a function to get several statistics for checking the assumptions for a linear regression. The function I'm quoting is not yet done, but I think you'll get the point:
check.regression <- function(regmodel, dataframe, resplots = TRUE,
durbin = TRUE, savecheck = TRUE) {
print(dwt(regmodel)) # Durbin-Watson-Test
dataframe$stand.res <- rstandard(regmodel) # Saving Standardized Residuals
}
As you see, I want to save the standardized residuals of the model into the given dataframe.
regmodel refers to the model computed by the linear regression lm( y~x) and dataframe is the name of the dataframe from which the regression model is computed.
The problem is: nothing is saved within my function. If I do the command without the function, the residuals are properly saved into my dataframe.
I guess, there has to be something like
save(dataframe$stand.res <- rstandard(regmodel))
as I also have to specify plotting or writing things to the console within a function, but I don't know how that command might be.
Any ideas?
R uses pass-by-value so what is sent to the function is a copy of your data.frame. (sort of, passing on some details.)
So when you call the function, you need to 1) return the modified data.frame and 2) assign it or you will lose the results.
check.regression <- function(regmodel, dataframe, resplots = TRUE,
durbin = TRUE, savecheck = TRUE) {
print(dwt(regmodel)) # Durbin-Watson-Test
dataframe$stand.res <- rstandard(regmodel) # Saving Standardized Residuals
return(dataframe)
}
dataframe <- check.regression(regmodel, dataframe)

Resources