Related
I am trying something pretty simple, want to run a bunch of regressions parallelly. When I use the following data generator (PART 1), The parallel part does not work and give the error listed below
#PART 1
p <- 20; rho<-0.7;
cdc<- diag(p)
for( i in 1:(p-1) ){ for( j in (i+1):p ){
cdc[i,j] <- cdc[j,i] <- rho^abs(i-j)
}}
my.data <- mvrnorm(n=100, mu = rep(0, p), Sigma = cdc)
The following Parallel Part does work but if I generate the data as PART 2
# PART 2
my.data<-matrix(rnorm(1000,0,1),nrow=100,ncol=10)
I configured the function that I want to run parallelly... as
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-coef(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
So I write the following way of running parlapply
{
no_cores <- detectCores(logical = TRUE)
myclusternumber<-(no_cores-1)
cl <- makeCluster(myclusternumber)
registerDoParallel(cl)
p1 <- ncol(my.data)
obj<-splitIndices(p1, myclusternumber)
clusterExport(cl,list('parallel_fun','my.data','obj'),envir=environment())
clusterEvalQ(cl, {
library(MASS)
library(Matrix)
library(BAS)
})
newresult<-parallel::parLapply(cl,obj,fun = parallel_fun,my.data)
stopCluster(cl)
}
But whenever am doing PART 1 I get the following error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: object 'my_df' not found
But this should not happen, the data frame should be created, I have no idea why this is happening. Any help is appreciated.
Posting this as one possible workaround, see if it works:
parallel_fun<-function(obj,my.data){
p1 <- nrow(cov(my.data));store.beta<-matrix(0,p1,length(obj))
count<-1
for (itration in obj) {
my_df<-data.frame(my.data)
colnames(my_df)[itration] <- "y"
my_df <<- my_df
my.model<-bas.lm(y ~ ., data= my_df, alpha=3,
prior="ZS-null", force.heredity = FALSE, pivot = TRUE)
cf<-BAS:::coef.bas(my.model, estimator="MPM")
betas<-cf$postmean[-1]
store.beta[ -itration, count]<- betas
count<-count+1
}
result<-list('Beta'=store.beta)
}
The issue seems to be with BAS:::coef.bas function, that calls eval in order to get my_df and fails to do that when called in parallel. The "hack" here is to force my_df out to the parent environment by calling my_df <<- my_df.
There should be a better way to do this, but <<- might be the fastest one. In general, <<- may cause unwanted behaviour, especially when used in loops. Assigning unique variable name before exporting (and don't forgetting to remove after use) is one way to tackle them.
I'm running several models and functions with the models. I need to apply values from a list to a default R function, e.g., modelFit()
My idea is to run the same function to several models and compile the results after.
I'm trying to use loops and the Family of 'apply' functions in R, but with no success.
#package drc is necessary
library(drc)
#my data
rates <- c(.1,.1,.1,1,1,1,10,10,10,100,100,100,1000,1000,1000,.1,.1,.1,1,1,1,10,10,10,100,100,100,1000,1000,1000)
prod <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","B","B","B","B","B")
resp <- c(.295,.32,.301,.155,.1501,.148,.05,.03,.044,.002,.001,.0015,.001,.0005,.0003,.312,.337,.299,.265,.2501,.248,.102,.103,.114,.02,.01,.015,.003,.0002,.0007)
data.test <- data.frame(rates,prod,resp) #my data frame
#my models
m1 <- drm(resp~rates, fct=LL.4(), data=data.test[data.test$prod=="A",])
m2 <- drm(resp~rates, fct=LL.4(), data=data.test[data.test$prod=="B",])
#lack of fit test
modelFit(m1)
modelFit(m2)
#I can get the modelFit p-values this way:
modelFit(m1)$"p value"[2]
modelFit(m2)$"p value"[2]
#I have several models. I want to create a loop to give me only the p value for each model fit. I want to use that for other information given by the summaries and function
#list of models
modelsList <- c("m1","m2")
#I can print the strings with the loop
for(i in modelsList){
print(i)
}
My idea was to use the strings to add information for the default 'drc' function modelFit(). The result would be a list with all p values from all the results in the loop, but I'm getting the error:
Error: $ operator is invalid for atomic vectors
#Not working
for(i in modelsList){
modelFit(i)$"p value"[2]
}
#Error
# Error: $ operator is invalid for atomic vectors
#Trying to use lapply
#Following this logic
lapply(modelsList, function(x) print(x))
#I could not get the results
lapply(modelsList, function(x) modelFit(x)$"p value"[2])
With the results, I'd continue and I'd create a data.frame with all the models and respective p values.
good to see you, Danilo
I think you missed drm function.
modelsList <- c("m1","m2")
prod = c("A","B")
type = data.frame(modelsList,
prod)
#I can print the strings with the loop
for(i in modelsList){
print(i)
}
for(i in modelsList){
model_Prod = type[modelsList == i,]$prod
drm_result<-drm(resp~rates, fct=LL.4(), data=data.test[data.test$prod==model_Prod,])
print(modelFit(drm_result)$"p value"[2])
}
# lapply
lapply(modelsList, function(x) {
model_Prod = type[modelsList == x,]$prod
x<-drm(resp~rates, fct=LL.4(), data=data.test[data.test$prod==model_Prod,])
modelFit(x)$"p value"[2]
}
)
Continuing, after Steve Lee inputs, I could generate a data.frame with all results I got from the loop:
#source of information:
#https://stackoverflow.com/questions/25285570/invalid-factor-level-with-rbind-to-data-frame
#df.results will be a data.frane with headers = "Model" and "ModelFit_pvalue""
df.results <- NULL
#Loop to print only specif information requested from function and add it to a data frame
for(i in modelsList){
model_Prod = type[modelsList == i,]$prod
drm_result<-drm(resp~rates, fct=LL.4(), data=data.test[data.test$prod==model_Prod,])
#print(i)
#print(modelFit(drm_result)$"p value"[2])
modelpvalue <- modelFit(drm_result)$"p value"[2] #store p values in a variable
de <- data.frame("Model"=i,"ModelFit_pvalue"= modelpvalue)
df.results = rbind(df.results,de)
}
df.results
#df.results output:
Model ModelFit_pvalue
1 m1 0.2346460
2 m2 0.5757368
I've got some code that creates an object from a formula and saves the call for future use, as so:
create_obj <- function(formula, data) {
obj <- list()
# Populate obj with whatever we're interested in
# ...
# Save call for future use
obj$call <- match.call()
obj
}
obj <- create_obj(time ~ sex, data)
If I then bootstrap data I can easily build the model on the new dataset:
data <- data[sample(1:nrow(data), replace=T), ]
new_obj <- eval(obj$call)
However, if I have the formula saved in a variable and I pass the object into a new environment this won't work:
do_stuff <- function(object, newdata) {
data <- newdata[sample(1:nrow(newdata), replace=T), ]
new_object <- eval(object$call)
}
main <- function() {
my_form <- time ~ sex
obj2 <- create_obj(my_form, data)
# obj2$call: 'create_obj(formula = my_form, data = data)'
do_stuff(obj2, data)
}
Error: object my_form not found.
How can I have it so that obj$call saves time~sex rather than myform? Otherwise I need to pass the formula itself around rather than just the object, limiting the practicality.
The above example isn't reproducible but you can see the same thing with a standard lm call.
EDIT: I've solved the problem, see the accepted answer.
I've solved it by having the constructor function modify the call by evaluating the constant argument in the local environment:
create_obj <- function(formula, data) {
obj <- list()
# Populate obj with whatever we're interested in
# ...
# Save call for future use
func_call <- match.call()
func_call$formula <- eval(formula)
# obj_call is now: create_obj(formula=time~sex, data=data)
obj$call <- func_call
obj
}
I have a matrix of ts values and i need to save all garch models (models construct for each column af a matrix) in a list. I tried this code and it doesn't work (i can't understand why):
model = list()
for (i in ncol(p)) {
if (length(model) == 0) {
model = list(ugarchfit(data=p[-1,i], spec=spec))
} else {
bufer = ugarchfit(data=p[-1,i], spec=spec)
model = cbind(model, bufer)
}
}
Can somebody help me to cope with this? I need to address this list with the column index and get a model for this column index.
Thanks!
It is better the create the list with final dimensions rather that to create a growing list.
models <- vector("list", ncol(p))
Then overwrite the relevant element of the list
for (i in seq_len(ncol(p))) {
models[[i]] <- ugarchfit(data = p[-1, i], spec = spec)
}
Another solution would to use lapply
models <- lapply(
seq_len(ncol(p)),
function(i) {
ugarchfit(data = p[-1, i], spec = spec)
}
)
Note that your code will be more clear is you use = for passing arguments to a function and <- for assignments.
I am trying to apply a function I wrote that uses the 'pls' package to make a model and then use it
to predict several test set(in this case 9), returning the R2,RMSEP and prediction bias of each test set
for n number of subset selected from the data frame.
the function is
cpo<-function(data,newdata1,newdata2,newdata3,newdata4,newdata5,newdata6,newdata7,newdata8,newdata9){
data.pls<-plsr(protein~.,8,data=data,validation="LOO")#making a pls model
newdata1.pred<-predict(data.pls,8,newdata=newdata1) #using the model to predict test sets
newdata2.pred<-predict(data.pls,8,newdata=newdata2)
newdata3.pred<-predict(data.pls,8,newdata=newdata3)
newdata4.pred<-predict(data.pls,8,newdata=newdata4)
newdata5.pred<-predict(data.pls,8,newdata=newdata5)
newdata6.pred<-predict(data.pls,8,newdata=newdata6)
newdata7.pred<-predict(data.pls,8,newdata=newdata7)
newdata8.pred<-predict(data.pls,8,newdata=newdata8)
newdata9.pred<-predict(data.pls,8,newdata=newdata9)
pred.bias1<-mean(newdata1.pred-newdata1[742]) #calculating the prediction bias
pred.bias2<-mean(newdata2.pred-newdata2[742])
pred.bias3<-mean(newdata3.pred-newdata3[742]) #[742] reference values in column742
pred.bias4<-mean(newdata4.pred-newdata4[742])
pred.bias5<-mean(newdata5.pred-newdata5[742])
pred.bias6<-mean(newdata6.pred-newdata6[742])
pred.bias7<-mean(newdata7.pred-newdata7[742])
pred.bias8<-mean(newdata8.pred-newdata8[742])
pred.bias9<-mean(newdata9.pred-newdata9[742])
r<-c(R2(data.pls,"train"),RMSEP(data.pls,"train"),pred.bias1,
pred.bias2,pred.bias3,pred.bias4,pred.bias5,pred.bias6,
pred.bias7,pred.bias8,pred.bias9)
return(r)
}
selecting n number of subsets (based on an answer from my question[1]: Select several subsets by taking different row interval and appy function to all subsets
and applying cpo function to each subset I tried
Edited based on #Gavin advice
FO03 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets) # sub is the n number subsets created by selecting rows
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
for(i in sub) { #for every subset in sub i want to apply cpo
outList[[i]] <- cpo(data=sub,newdata1=gag11p,newdata2=gag12p,newdata3=gag13p,
newdata4=gag21p,newdata5=gag22p,newdata6=gag23p,
newdata7=gag31p,newdata8=gag32p,newdata9=gag33p) #new data are test sets loaded in the workspace
}
return(outlist)
}
FOO3(GAGp,10,10)
when I try this I keep getting 'Error in eval(expr, envir, enclos) : object 'protein' not found' not found.
Protein is used in the plsr formula of cpo, and is in the data set.
I then tried to use the plsr function directly as seen below
FOO4 <- function(data, nSubsets, nSkip){
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]
}
cal<-vector("list", length=nSubsets) #for each subset in sub make a pls model for protein
names(cal)<-c(paste("cal",1:nSubsets, sep=""))
for(i in sub) {
cal[[i]] <- plsr(protein~.,8,data=sub,validation="LOO")
}
return(outlist) # return is just used to end script and check if error still occurs
}
FOO4(gagpm,10,10)
When I tried this I get the same error 'Error in eval(expr, envir, enclos) : object 'protein' not found'.
Any advice on how to deal with this and make the function work will be much appreciated.
I suspect the problem is immediately at the start of FOO3():
FOO3 <- function(data, nSubsets, nSkip) {
outList <- vector("list", r <- c(R2(data.pls,"train"), RMSEP(data.pls,"train"),
pred.bias1, pred.bias2, pred.bias3, pred.bias4, pred.bias5,
pred.bias6, pred.bias7, pred.bias8, pred.bias9))
Not sure what you are trying to do when creating outList, but vector() has two arguments and you seem to be assigning to r a vector of numerics that you want R to use as the length argument to vector().
Here you are using the object data.pls and this doesn't exist yet - and never will in the frame of FOO3() - it is only ever created in cpo().
Your second loop looks totally wrong - you are not assigning the output from cpo() to anything. I suspect you wanted:
outList <- vector("list", 11)
names(outList) <- c("R2train","RMSEPtrain", paste("bias", 1:9, sep = ""))
....
for(i in subset) {
outList[[i]] <- cpo(....)
}
return(outList)
But that depends on what subset is etc. You also haven't got the syntax for this loop right. You have
for(i in(subset)) {
when it should be
for(i in subset) {
And subset and data aren't great names as these are common R functions and modelling arguments.
There are lots of problems with your code. Try to start simple and build up from there.
I have managed to achieved what i wanted using this, if there is a better way of doing it (i'm sure there must be) I'm eager to learn.This function preforms the following task
1. select "n" number of subsets from a dataframe
2. For each subset created, a plsr model is made
3. Each plsr model is used to predict 9 test sets
4. For each prediction, the prediction bias is calculated
far5<- function(data, nSubsets, nSkip){
sub <- vector("list", length = nSubsets)
names(sub) <- c( paste("sub", 1:nSubsets, sep = ""))
totRow <- nrow(data)
for (i in seq_len(nSubsets)) {
rowsToGrab <- seq(i, totRow, nSkip)
sub[[i]] <- data[rowsToGrab ,]} #sub is the subsets created
mop<- lapply(sub,cpr2) #assigning output from cpr to mop
names(mop)<-c(paste("mop", mop, sep=""))
return(names(mop))
}
call: far5(data,nSubsets, nSkip))
The first part -selecting the subsets is based on the answer to my question Select several subsets by taking different row interval and appy function to all subsets
I was then able to apply the function cpr2 to the subsets created using "lapply" instead of the "for' loop as was previously done.
cpr2 is a modification of cpo, for which only data is supplied, and the new data to be predicted is used directly in the function as shown below.
cpr2<-function(data){
data.pls<-plsr(protein~.,8,data=data,validation="LOO") #make plsr model
gag11p.pred<-predict(data.pls,8,newdata=gag11p) #predict each test set
gag12p.pred<-predict(data.pls,8,newdata=gag12p)
gag13p.pred<-predict(data.pls,8,newdata=gag13p)
gag21p.pred<-predict(data.pls,8,newdata=gag21p)
gag22p.pred<-predict(data.pls,8,newdata=gag22p)
gag23p.pred<-predict(data.pls,8,newdata=gag23p)
gag31p.pred<-predict(data.pls,8,newdata=gag31p)
gag32p.pred<-predict(data.pls,8,newdata=gag32p)
gag33p.pred<-predict(data.pls,8,newdata=gag33p)
pred.bias1<-mean(gag11p.pred-gag11p[742]) #calculate prediction bias
pred.bias2<-mean(gag12p.pred-gag12p[742])
pred.bias3<-mean(gag13p.pred-gag13p[742])
pred.bias4<-mean(gag21p.pred-gag21p[742])
pred.bias5<-mean(gag22p.pred-gag22p[742])
pred.bias6<-mean(gag23p.pred-gag23p[742])
pred.bias7<-mean(gag31p.pred-gag31p[742])
pred.bias8<-mean(gag32p.pred-gag32p[742])
pred.bias9<-mean(gag33p.pred-gag33p[742])
r<-signif(c(pred.bias1,pred.bias2,pred.bias3,pred.bias4,pred.bias5,
pred.bias6,pred.bias7,pred.bias8,pred.bias9),2)
out<-c(R2(data.pls,"train",ncomp=8),RMSEP(data.pls,"train",ncomp=8),r)
return(out)
} #signif use to return 2 decimal place for prediction bias
call:cpr2(data)
I was able to use this to solve my problem, however since the amount of new data to be predicted was only nine, it was possible to list them out as i did. If there is a more generalized way to do this I'm interested in learning.