I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.
I'm trying to fit a von Bertalanffy growth function (VGBF) in r to my data grouped by a serial number.
This is a snippet of my data:
Serial_No<- c(315,315,315,315,315,315,315,316,316,316,316,317,317,317,317,317,317,317,317,317,318,318,318,318,319,319,319,319)
Year<-c(1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945)
tl<-c(19,33,46,55,63,66,70,22,39,55,65,20,40,51,56,60,62,63,64,65,26,43,54,60,28,43,53,61)
age<-c(1,2,3,4,5,6,7,1,2,3,4,1,2,3,4,5,6,7,8,9,1,2,3,4,1,2,3,4))
df<-data.frame(Serial_No, Year, tl, age)
I've been following this example: https://www.r-bloggers.com/2020/01/von-bertalanffy-growth-plots-ii/
and have changed my code to as follows:
vb <- vbFuns()
predict2 <- function(x) predict(x,data.frame(age=ages))
agesum <- group_by(df,Serial_No) %>%
summarize(minage=min(age),maxage=max(age))
Serial_Nos <- unique(df$Serial_No)
nSerial_Nos <- length(Serial_Nos)
cfs <- cis <- preds1 <- preds2 <- NULL
for (i in 1:nSerial_Nos) {
cat(Serial_Nos[i],"Loop\n")
tmp1 <- filter(df,Serial_No==Serial_Nos[i])
sv1 <- vbStarts(tl~age,data=tmp1)
fit1 <- nls(tl~vb(age,Linf,K,t0),data=tmp1,start=sv1)
cfs <- rbind(cfs,coef(fit1))
boot1 <- Boot(fit1)
tmp2 <- confint(boot1)
cis <- rbind(cis,c(tmp2["Linf",],tmp2["K",],tmp2["t0",]))
ages <- seq(-1,16,0.2)
boot2 <- Boot(fit1,f=predict2)
tmp2 <- data.frame(Serial_No=Serial_Nos[i],age=ages,
predict(fit1,data.frame(age=ages)),
confint(boot2))
preds1 <- rbind(preds1,tmp2)
tmp2 <- filter(tmp2,age>=agesum$minage[i],age<=agesum$maxage[i])
preds2 <- rbind(preds2,tmp2)
}
The code runs, but the results from the VBGF returned are the same for every serial no, which can't be the case. I think it's the filter function not working in the above code.
I've searched for solutions but can't get it to work.
If anyone can please help, or knows of a solution i'd really appreciate it.
Thank you in advance
Model fit with package growthrates
The following post describes an alternative approach without for-loop and filter. Similar loop-free solutions can be implemented using the common nls function and lapply in "base" R or group_by in "tidyverse".
Model definition
The growthrates package does not contain a von Bertalanffy function, so it has to be provided as user supplied model, as described in the package vignette. Here I borrowed the function from package FSA and adapted it accordingly:
library("growthrates")
grow_von_bert <- function(time, parms) {
with(as.list(parms), {
y <- Linf * (1 - exp(-K * (time - t0)))
as.matrix(data.frame(time = time, y = y))
})
}
Test of the model with a single example
p <- c(t0=5, Linf=10, K=.1)
time <- seq(5, 100)
plot(grow_von_bert(time, p), type="l")
Fit of a single data example
It is always a good idea to fit one or more single examples first, before doing this for all.
df1 <- subset(df, Serial_No == 315)
fit1 <- fit_growthmodel(df1$age, df1$tl,
FUN = grow_von_bert, p=c(t0=0, Linf=70, K=0.1))
summary(fit1)
Fit of all data sets
This can be done in a loop or with appropriate tidyverse functions, whipe package growthrates has such a function already built in, so all models can be fitted with a single function call. It is of course necessary to specify good start parameters, either the same for all curves or individual parameter sets, depending on the quality of the data. Here is the complete code including the data of the OP:
library("growthrates")
df <- data.frame(
Serial_No = factor(c(315,315,315,315,315,315,315,316,316,316,316,317,317,317,317,
317,317,317,317,317,318,318,318,318,319,319,319,319)),
year = c(1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,
1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945),
tl = c(19,33,46,55,63,66,70,22,39,55,65,20,40,51,56,60,62,63,64,65,26,43,54,60,28,
43,53,61),
age = c(1,2,3,4,5,6,7,1,2,3,4,1,2,3,4,5,6,7,8,9,1,2,3,4,1,2,3,4)
)
grow_von_bert <- function(time, parms) {
with(as.list(parms), {
y <- Linf * (1 - exp(-K * (time - t0)))
as.matrix(data.frame(time = time, y = y))
})
}
fit <- all_growthmodels(tl ~ age | Serial_No,
data=df,
FUN = grow_von_bert,
p=c(t0=0, Linf=70, K=0.1))
results(fit)
par(mfrow=c(2,3))
plot(fit, las=1)
I have a list of model summaries (let's say it is a linear model; but this could apply to any model summary).
Currently, I am extracting a certain coefficient from this list of summaries using the following:
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
How could I do this by calling the variable name instead of relying on row position?
For each of the model summaries within the list, only one variable differs, which is named as V_01, V_02, V_03 etc. This is the variable coefficient I would like to extract.
I was thinking of using the grep function, something like:
coef <- sapply(modelsummaries, function(x) x[[4]][[grep("^[V]"),1]])
...but haven't got it working. Any suggestions?
Here's a reproducible example (only the last line needs to be tweaked):
newdata <- as.data.frame(seq(from = 0.1, to = 0.9, by = 0.1))
newdata <- as.data.frame(t(newdata))
colnames(newdata) = newdata[1, ]
colnames(newdata) <- paste("V", colnames(newdata), sep = "_")
mtcars <- mtcars
mtcarsmodel <- data.frame(mtcars, newdata)
mtcarsmodel[c(12:20)] <- sample(1:100, 288, replace=TRUE)
xnam <- paste(colnames(mtcarsmodel)[c(4:5)], sep="")
xnam2 <- paste(colnames(mtcarsmodel)[c(12:20)], sep="")
fmla <- paste(xnam, collapse= "+")
fmla2 <- paste(paste(fmla), "+")
fla <- paste("mpg ~", paste(fmla2))
models <- lapply(setNames(xnam2, xnam2), function(var) {
form = paste(fla, var)
lm(form, data=mtcarsmodel)
})
modelsummaries <-lapply(models, summary)
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
You were quite close, you just needed to tell grep what to search on, which is the rownames of the coefficient matrix returned by coef() (which is a better way to get them than [[4]]). Also so as not to reuse that name, I suggest saving the result in something different, like coefs.
coefs <- sapply(modelsummaries, function(x) {
coef(x)[grep("^V", rownames(coef(x))),"Estimate"]
})
V_0.1 V_0.2 V_0.3 V_0.4 V_0.5 V_0.6 V_0.7 V_0.8
0.030927774 -0.053437459 0.009335911 -0.011009187 -0.010303494 -0.001705420 -0.036297492 0.021838044
V_0.9
0.005457086
Also, check out the new broom package which can make it easier to extract certain information from models in a tidy way.
After struggling with a grep solution, I committed blasphemy and used an sql solution instead:
library('sqldf')
new <- lapply(modelsummaries, function(x) setDT(data.frame(x[[4]]), keep.rownames = TRUE)[])
values <- sapply(new, function(x) sqldf("SELECT x.estimate, x.'Pr...t..' FROM x WHERE rn like '%V_%'"))
data <- as.data.frame(t(rbind(values)))
I've also come up with a (somewhat ugly) grep based solution:
coef <- sapply(modelsummaries, function(x) as.numeric(unlist(strsplit(grep("^V_", capture.output(x), value = TRUE), "\\s+"))[[2]]))
In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))
I want to build post-hoc testing for log-rank test, that compare individual groups to each other:
library(survival)
survdiff(DV ~ IV, data=mydf)
Is there any way to do something like this (e.g. to compare group 2 with group 7):
survdiff(DV ~ I(if(as.numeric(IV) == 2) {1} else {if(as.numeric(IV) == 7) {2} else {NA}} ), data=mydf )
I know that I can filter out not needed rows from mydf and the Surv object DV
df2vs7<-mydf[as.numeric(mydf$IV)==2 | as.numeric(mydf$IV)==7,]
DV2vs7<-DV[as.numeric(mydf$IV)==2 | as.numeric(mydf$IV)==7,]
But I think it is highly inefficient; the computer would need to store each separate data objects for each combination of compared groups.
Use the subset argument to survdiff inside a (nested) loop. You might have to tweak the loops to avoid off-by-one errors, but you get the idea.
l <- list()
n <- <no. of groups>
mydf$IV <- as.numeric(mydf$IV)
for(i in seq_len(n - 1))
{
for(j in seq(i + 1, to=n, by=1))
{
l <- c(l, survdiff(DV ~ factor(IV), mydf, subset=IV %in% c(i, j)))
}
}
OK, this is a copy&paste solution. Suppose we have df a dataframe object, where we have IV categorical variable with length(levels(df$IV)) levels, and the DF object of type Surv and we want to perform pairwise log-rank tests between each pair of groups
library(gregmisc)
levels<-sort(unique(as.numeric((mydf$IV))))
groups<-combinations(length(levels),2,levels)
#or if we assume that levels produced by as.numeric(mydf$IV) are in sequence 1:n, we can use more efficient:
#groups<-combinations(length(levels(df$IV)),2)
library(plyr)
alply(groups, 1,
function(pair) {
survdiff(DV ~ factor(IV), mydf, subset=IV %in% c(pair[[1]], pair[[2]]))
}
)
The last expression returns the results
Something like this?
survdiff(DV ~ IV, data=mydf[ as.numeric(mydf$IV) %in% c(2,7),] )