R: Cannot find user defined function (nested design) - r

Please bear in mind that this is only my second day writing R code instead of using it, and I'm taking on a project almost surely above my level. A lot of my code is probably inefficient.
I'm trying to write R code which will automate the majority of my multiple regression analysis, while still allowing manual fine tuning in terms of the # of predictors, data transformations and model assumptions. I always get the error:
Error: could not find function "Dat.assumptn"
General advice on getting this nested function design to work is appreciated. Also, could someone please post a few well written links on functions in R which cover a range of difficulty?
As for my other issues, such as implementing pass by reference behavior via a package like R.oo from CRAN or a source from R: Pass by reference, I think I can figure it out. Here is a part of my code (incomplete and needs rewriting):
Dat.assumptn <- function(f, final, caperDS, picanDS) {
print(f)
crunchMod <- crunch(f, data = contrasts)
print(caic.table(crunchMod))
print(caic.diagnostics(crunchMod))
print(summary(crunchMod))
#If independent contrasts assumptions fail, return me to the second for loop
#within Dat.analysis() [Not Yet Implemented]
#Implement code to reduce and check the model (whether final = true/false)
if (final == TRUE) {
retry <- Dat.Msucess(crunchMod)
#The above function will recommend additional transformations if the final
#reduced model significantly violated model assumptions.
}
}
Dat.analysis <- function() {
treList <- dir(pattern="*.tre") //All of my phylogenetic tree files
caperDS <- read.table("dataSet.txt", header = TRUE)
picanDS <- read.table("dataSet.txt", row.names = 1, header = TRUE)
#Dat.assumptn() requires a different format from Dat.analysis()
#The loop below changes the names from my data set to be proper variable names
for (i in 1:length(names(picanDS))) {
varName <- gsub("_|[0-9]|\\.", "", names(picanDS)[i])
names(caperDS)[i+1] <- varName
names(picanDS)[i] <- varName
caperDS[,paste(varName,"2",sep="")] <- caperDS[i+1]*caperDS[i+1]
}
#Implement a for loop to transform the data based upon specifications from both
#Dat.assumptn() [called from Dat.analysis] and Dat.Msuccess [called from Dat.assumptn].
#Likely using pass by reference.
for (i in 1:length(treList)) {
myTrees = read.nexus(treList[i])
for (j in 1:length(myTrees)) {
cat(paste("\n\n", treList[i]))
print(multiPhylosignal(picanDS, myTrees[[j]]))
contrasts <- comparative.data(myTrees[[j]], caperDS, Species)
if (names(caperDS)[3] == "MedF" || names(caperDS)[3] == "MaxF") {
final <- FALSE
f <- as.formula(paste(paste(names(caperDS)[2],"~"),
paste(paste(paste("(",paste(names(caperDS)[4:(ncol(picanDS)+1)], collapse="+"))),")^2"),
paste("+", paste(names(caperDS)[(ncol(picanDS)+4):ncol(caperDS)], collapse = "+"))))
while (final == FALSE) {
f <- Dat.assumptn(f, final, caperDS, picanDS)
#Pass final by reference, and set to true if the final reduced model
#is achieved. Otherwise, iterate to reduce the model.
}
final <- FALSE
f <- as.formula(paste(paste(names(caperDS)[3],"~"),
paste(paste(paste("(",paste(names(caperDS)[4:(ncol(picanDS)+1)], collapse="+"))),")^2"),
paste("+", paste(names(caperDS)[(ncol(picanDS)+4):ncol(caperDS)], collapse = "+"))))
while (final == FALSE) {
f <- Dat.assumptn(f, final, caperDS, picanDS)
#Pass final by reference, and set to true if the final reduced model
#is achieved. Otherwise, iterate to reduce the model.
}
} else {
final <- FALSE
f <- as.formula(paste(paste(names(caperDS)[2],"~"),
paste(paste(paste("(",paste(names(caperDS)[3:(ncol(picanDS)+1)], collapse="+"))),")^2"),
paste("+", paste(names(caperDS)[(ncol(picanDS)+3):ncol(caperDS)], collapse = "+"))))
while (final == FALSE) {
f <- Dat.assumptn(f, final, caperDS, picanDS)
#Pass final by reference, and set to true if the final reduced model
#is achieved. Otherwise, iterate to reduce the model.
}
}
}
}
}

Related

Data frame creation inside Parlapply in R

I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below
#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i+1):p ){
cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)
The following Parallel Part does work but if I generate the data as PART 2
# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)
I configured the function that I want to run parallelly... as
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-coef(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
So I write the following way of running parlapply
{
no_cores <- detectCores(logical = TRUE)
myclusternumber<-(no_cores-1)
cl <- makeCluster(myclusternumber)
registerDoParallel(cl)
p1 <- ncol(my.data)
obj<-splitIndices(p1, myclusternumber)
clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
clusterEvalQ(cl, {
library(MASS)
library(Matrix)
library(BAS)
})
newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
stopCluster(cl)
}
But whenever am doing PART 1 I get the following error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: object 'my_df' not found
But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.
Posting this as one possible workaround, see if it works:
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my_df <<- my_df
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-BAS:::coef.bas(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.
There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.

Appending functions results within a for loop

Not a reproducible example, but here is my problem. I have a data frame and am looping through a series of columns and performing forecasting. I've created a fuction that will handle the forecasting part as it's repetitive and place it within the for loop. So each iteration of the loop, the forecast_func loop will be used to perform the forecast and accuracy assessment, and rbind those results to the results data frame.
result = data.frame()
forecast_func = function(data){
data.train = rnorm(100)
data.test = rnorm(10)
library(forecast)
mod = auto.arima(data.train)
mod_forc = forecast(mod)$mean
mod_acc = accuracy(mod_forc, data.test)
result = rbind(result, data.frame(mod_forc))
}
result
for(i in ...){
...
forecast_func(data)
...
}
If I were doing this once, I know how to append the results to result. However, within a for loop, every time I use the function, I get an empty data frame. Obviously, it's not returning to the global environment. However, I was wondering what were some useful strategies for attaching this problem.
I believe you can use the parent assignment operator here (<<-):
forecast_func = function(data){
data.train = rnorm(100)
data.test = rnorm(10)
library(forecast)
mod = auto.arima(data.train)
mod_forc = forecast(mod)$mean
mod_acc = accuracy(mod_forc, data.test)
# assign to 'result' data frame in the parent (calling) scope
result <<- rbind(result, data.frame(mod_forc))
}
result = data.frame()
for (i in ...) {
...
forecast_func(data)
...
}
This being said, using the parent assignment operator is usually frowned upon, based on what I have seen here on SO and elsewhere.
One way to avoid having use <<- would be to refactor your forecast_func to return a data frame, instead of trying to persist the result internally:
forecast_func = function(data){
data.train = rnorm(100)
data.test = rnorm(10)
library(forecast)
mod = auto.arima(data.train)
mod_forc = forecast(mod)$mean
mod_acc = accuracy(mod_forc, data.test)
return(data.frame(mod_forc))
}
result = data.frame()
for (i in ...) {
...
# just rbind() here in the calling scope
result = rbind(result, forecast_func(data))
...
}

Saving more then 1 models in a list in R

I have a matrix of ts values and i need to save all garch models (models construct for each column af a matrix) in a list. I tried this code and it doesn't work (i can't understand why):
model = list()
for (i in ncol(p)) {
if (length(model) == 0) {
model = list(ugarchfit(data=p[-1,i], spec=spec))
} else {
bufer = ugarchfit(data=p[-1,i], spec=spec)
model = cbind(model, bufer)
}
}
Can somebody help me to cope with this? I need to address this list with the column index and get a model for this column index.
Thanks!
It is better the create the list with final dimensions rather that to create a growing list.
models <- vector("list", ncol(p))
Then overwrite the relevant element of the list
for (i in seq_len(ncol(p))) {
models[[i]] <- ugarchfit(data = p[-1, i], spec = spec)
}
Another solution would to use lapply
models <- lapply(
seq_len(ncol(p)),
function(i) {
ugarchfit(data = p[-1, i], spec = spec)
}
)
Note that your code will be more clear is you use = for passing arguments to a function and <- for assignments.

R: make pls calibration models from n number of subset and use them to predict different test sets

I am trying to apply a function I wrote that uses the 'pls' package to make a model and then use it
to predict several test set(in this case 9), returning the R2,RMSEP and prediction bias of each test set
for n number of subset selected from the data frame.
the function is
cpo<-function(data,newdata1,newdata2,newdata3,newdata4,newdata5,newdata6,newdata7,newdata8,newdata9){
data.pls<-plsr(protein~.,8,data=data,validation="LOO")#making a pls model
newdata1.pred<-predict(data.pls,8,newdata=newdata1) #using the model to predict test sets
newdata2.pred<-predict(data.pls,8,newdata=newdata2)
newdata3.pred<-predict(data.pls,8,newdata=newdata3)
newdata4.pred<-predict(data.pls,8,newdata=newdata4)
newdata5.pred<-predict(data.pls,8,newdata=newdata5)
newdata6.pred<-predict(data.pls,8,newdata=newdata6)
newdata7.pred<-predict(data.pls,8,newdata=newdata7)
newdata8.pred<-predict(data.pls,8,newdata=newdata8)
newdata9.pred<-predict(data.pls,8,newdata=newdata9)
pred.bias1<-mean(newdata1.pred-newdata1[742]) #calculating the prediction bias
pred.bias2<-mean(newdata2.pred-newdata2[742])
pred.bias3<-mean(newdata3.pred-newdata3[742]) #[742] reference values in column742
pred.bias4<-mean(newdata4.pred-newdata4[742])
pred.bias5<-mean(newdata5.pred-newdata5[742])
pred.bias6<-mean(newdata6.pred-newdata6[742])
pred.bias7<-mean(newdata7.pred-newdata7[742])
pred.bias8<-mean(newdata8.pred-newdata8[742])
pred.bias9<-mean(newdata9.pred-newdata9[742])
r<-c(R2(data.pls,"train"),RMSEP(data.pls,"train"),pred.bias1,
pred.bias2,pred.bias3,pred.bias4,pred.bias5,pred.bias6,
pred.bias7,pred.bias8,pred.bias9)
return(r)
}
selecting n number of subsets (based on an answer from my question[1]: Select several subsets by taking different row interval and appy function to all subsets
and applying cpo function to each subset I tried
Edited based on #Gavin advice
FO03 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets) # sub is the n number subsets created by selecting rows
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
for(i in sub) { #for every subset in sub i want to apply cpo
outList[[i]] <- cpo(data=sub,newdata1=gag11p,newdata2=gag12p,newdata3=gag13p,
newdata4=gag21p,newdata5=gag22p,newdata6=gag23p,
newdata7=gag31p,newdata8=gag32p,newdata9=gag33p) #new data are test sets loaded in the workspace
}
return(outlist)
}
FOO3(GAGp,10,10)
when I try this I keep getting 'Error in eval(expr, envir, enclos) : object 'protein' not found' not found.
Protein is used in the plsr formula of cpo, and is in the data set.
I then tried to use the plsr function directly as seen below
FOO4 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
cal<-vector("list", length=nSubsets) #for each subset in sub make a pls model for protein
names(cal)<-c(paste("cal",1:nSubsets, sep=""))
for(i in sub) {
cal[[i]] <- plsr(protein~.,8,data=sub,validation="LOO")
}
return(outlist) # return is just used to end script and check if error still occurs
}
FOO4(gagpm,10,10)
When I tried this I get the same error 'Error in eval(expr, envir, enclos) : object 'protein' not found'.
Any advice on how to deal with this and make the function work will be much appreciated.
I suspect the problem is immediately at the start of FOO3():
FOO3 <- function(data, nSubsets, nSkip) {
outList <- vector("list", r <- c(R2(data.pls,"train"), RMSEP(data.pls,"train"),
pred.bias1, pred.bias2, pred.bias3, pred.bias4, pred.bias5,
pred.bias6, pred.bias7, pred.bias8, pred.bias9))
Not sure what you are trying to do when creating outList, but vector() has two arguments and you seem to be assigning to r a vector of numerics that you want R to use as the length argument to vector().
Here you are using the object data.pls and this doesn't exist yet - and never will in the frame of FOO3() - it is only ever created in cpo().
Your second loop looks totally wrong - you are not assigning the output from cpo() to anything. I suspect you wanted:
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
....
for(i in subset) {
outList[[i]] <- cpo(....)
}
return(outList)
But that depends on what subset is etc. You also haven't got the syntax for this loop right. You have
for(i in(subset)) {
when it should be
for(i in subset) {
And subset and data aren't great names as these are common R functions and modelling arguments.
There are lots of problems with your code. Try to start simple and build up from there.
I have managed to achieved what i wanted using this, if there is a better way of doing it (i'm sure there must be) I'm eager to learn.This function preforms the following task
1. select "n" number of subsets from a dataframe
2. For each subset created, a plsr model is made
3. Each plsr model is used to predict 9 test sets
4. For each prediction, the prediction bias is calculated
far5<- function(data, nSubsets, nSkip){
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]} #sub is the subsets created
mop<- lapply(sub,cpr2) #assigning output from cpr to mop
names(mop)<-c(paste("mop", mop, sep=""))
return(names(mop))
}
call: far5(data,nSubsets, nSkip))
The first part -selecting the subsets is based on the answer to my question Select several subsets by taking different row interval and appy function to all subsets
I was then able to apply the function cpr2 to the subsets created using "lapply" instead of the "for' loop as was previously done.
cpr2 is a modification of cpo, for which only data is supplied, and the new data to be predicted is used directly in the function as shown below.
cpr2<-function(data){
data.pls<-plsr(protein~.,8,data=data,validation="LOO") #make plsr model
gag11p.pred<-predict(data.pls,8,newdata=gag11p) #predict each test set
gag12p.pred<-predict(data.pls,8,newdata=gag12p)
gag13p.pred<-predict(data.pls,8,newdata=gag13p)
gag21p.pred<-predict(data.pls,8,newdata=gag21p)
gag22p.pred<-predict(data.pls,8,newdata=gag22p)
gag23p.pred<-predict(data.pls,8,newdata=gag23p)
gag31p.pred<-predict(data.pls,8,newdata=gag31p)
gag32p.pred<-predict(data.pls,8,newdata=gag32p)
gag33p.pred<-predict(data.pls,8,newdata=gag33p)
pred.bias1<-mean(gag11p.pred-gag11p[742]) #calculate prediction bias
pred.bias2<-mean(gag12p.pred-gag12p[742])
pred.bias3<-mean(gag13p.pred-gag13p[742])
pred.bias4<-mean(gag21p.pred-gag21p[742])
pred.bias5<-mean(gag22p.pred-gag22p[742])
pred.bias6<-mean(gag23p.pred-gag23p[742])
pred.bias7<-mean(gag31p.pred-gag31p[742])
pred.bias8<-mean(gag32p.pred-gag32p[742])
pred.bias9<-mean(gag33p.pred-gag33p[742])
r<-signif(c(pred.bias1,pred.bias2,pred.bias3,pred.bias4,pred.bias5,
pred.bias6,pred.bias7,pred.bias8,pred.bias9),2)
out<-c(R2(data.pls,"train",ncomp=8),RMSEP(data.pls,"train",ncomp=8),r)
return(out)
} #signif use to return 2 decimal place for prediction bias
call:cpr2(data)
I was able to use this to solve my problem, however since the amount of new data to be predicted was only nine, it was possible to list them out as i did. If there is a more generalized way to do this I'm interested in learning.

selection of nlm starting values problem

Need to estimate two parameters using the nlm function;
fit<-nlm(hood2par,c(x01[i],x02[j]),iterlim=300, catch=x[,c(3,4,5)],sp=.5)
where hood2par is a modified logistic
The convergence of nlm depends on the starting values ​​of these parameters. To find such initial values I ​​automatically generate two vectors of starting values
x01 = seq(-10,-20,-0.1)
x02 = seq(0.1,0.9,0.01)
next I create a routine included in a double for() to find the values ​​that lead to the convergence of the function:
for (i in 1:length(x01)) { for (j in 1:length(x02)) {
fit <- NULL
try(fit <- nlm(hood2par, c(x01[i],x02[j]), iterlim = 300, catch = x[,c(3,4,5)],
sp = .5),
silent = TRUE)
stopifnot(is.null(fit))}}
The problem I have is that when I include the previous routine in a function:
FFF <- function(x01, x02, catch){
for (i in 1:length(x01)) {
for (j in 1:length(x02)) {
fit <- NULL
try(fit <- nlm(hood2par, c(x01[i], x02[j]), iterlim = 300,
catch = x[,c(3,4,5)], sp = .5),
silent = TRUE) # does not stop in the case of err
stopifnot(is.null(fit))
}
}
return(fit)
}
I can´t get the 'fit' values from FFF():
> fit.fff<-FFF(x01,x02,catch)
#Error: is.null(fit) is not TRUE
>fit.fff
fit.fff
Error: object 'fit.fff' not found
I used stopifnot(is.null(fit)) to stop the loops when fit is not NULL (as fit is defined as a NULL object before try(...)). Regarding the try code you have shared, I just need this;
res <- try(some_expression)
if(inherits(res, "try-error"))
{
#some code to keep loops running
} else
{
#stop the loops and gather "res"
}
I tried to include the break function in the second argument of the condictional, but it doesn´t run in my R version...Any idea??
When you call FFF, inside the try block if nlm successfully completes, then fit is assigned, and the stopifnot condition is activated, throwing an error.
Wildly guessing, did you mean
stopifnot(!is.null(fit))
For future reference, a standard chunk of code for use with try is
res <- try(some_expression)
if(inherits(res, "try-error"))
{
#some error handling code
} else
{
#normal execution
}

Resources