Create numerous statistical models based on selection criteria in a separate dataframe - r

I want to perform a certain number of statistical models based on selection criteria specified in a dataframe. So using a basic example, say I had 2 responses variables and 2 explanatory variables:
#######################Data Input############################
Responses <- as.data.frame(matrix(sample(0:10, 1*100, replace=TRUE), ncol=2))
colnames(Responses) <- c("A","B")
Explanatories <- as.data.frame(matrix(sample(20:30, 1*100, replace=TRUE), ncol=2))
colnames(Explanatories) <- c("x","y")
I then define which statistical models that I would like to run, which can include different combinations of Response / Explanatory variables and different statistical functions:
###################Model selection#########################
Function <- c("LIN","LOG","EXP") ##Linear, Logarithmic (base 10) and exponential - see the formula for these below
Respo <- c("A","B","B")
Explan <- c("x","x","y")
Model_selection <- data.frame(Function,Respo,Explan)
How do I then perform a list of models based on these selection criteria? Here is an example of the models I would like to create based on the inputs from the Model_selection data frame.
####################Model creation#########################
Models <- list(
lm(Responses$A ~ Explanatories$x),
lm(Responses$B ~ log10(Explanatories$x)),
lm(Responses$B ~ exp(Explanatories$y))
)
I would guess that some kind of loop function would be required and after looking around perhaps paste too? Thanks in advance for any help with this

This isn't the prettiest solution, but it seems to work for your example:
Models <- list()
idx <- 1L
for (row in 1:nrow(Model_selection)){
if (Model_selection$Function[row]=='LOG'){
expl <- paste0('LOG', Model_selection$Explan[row])
Explanatories[[expl]] <- log10(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='EXP'){
expl <- paste0('EXP', Model_selection$Explan[row])
Explanatories[[expl]] <- exp(Explanatories[[Model_selection$Explan[row]]])
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
if (Model_selection$Function[row]=='LIN'){
expl <- paste0('LIN', Model_selection$Explan[row])
Explanatories[[expl]] <- Explanatories[[Model_selection$Explan[row]]]
Models[[idx]] <- lm(Responses[[Model_selection$Respo[row]]] ~ Explanatories[[expl]])
}
names(Models)[idx] <- paste(Model_selection$Respo[row], '~', expl)
idx <- idx+1L
}
Models

This is a perfect use-case for the tidyverse
library(tidyverse)
## cbind both data sets into one
my_data <- cbind(Responses, Explanatories)
## use 'mutate' to change function names to the existing function names
## mutate_all to transform implicit factors to characters
## NB this step could be ommitted if Function would already use the proper names
model_params <- Model_selection %>%
mutate(Function = case_when(Function == "LIN" ~ "identity",
Function == "LOG" ~ "log10",
Function == "EXP" ~ "exp")) %>%
mutate_all(as.character)
## create a function which estimates the model given the parameters
## NB: function params must be named exactly like columns
## in the model_selection df
make_model <- function(Function, Respo, Explan) {
my_formula <- formula(paste0(Respo, "~", Function, "(", Explan, ")"))
my_mod <- lm(my_formula, data = my_data)
## syntactic sugar: such that we see the value of the formula in the print
my_mod$call$formula <- my_formula
my_mod
}
## use purrr::pmap to loop over the model params
## creates a list with all the models
pmap(model_params, make_model)

Related

Expand for-loop to accommodate list in R?

I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.

Solution for filter() not working within a For Loop?

I'm trying to fit a von Bertalanffy growth function (VGBF) in r to my data grouped by a serial number.
This is a snippet of my data:
Serial_No<- c(315,315,315,315,315,315,315,316,316,316,316,317,317,317,317,317,317,317,317,317,318,318,318,318,319,319,319,319)
Year<-c(1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945)
tl<-c(19,33,46,55,63,66,70,22,39,55,65,20,40,51,56,60,62,63,64,65,26,43,54,60,28,43,53,61)
age<-c(1,2,3,4,5,6,7,1,2,3,4,1,2,3,4,5,6,7,8,9,1,2,3,4,1,2,3,4))
df<-data.frame(Serial_No, Year, tl, age)
I've been following this example: https://www.r-bloggers.com/2020/01/von-bertalanffy-growth-plots-ii/
and have changed my code to as follows:
vb <- vbFuns()
predict2 <- function(x) predict(x,data.frame(age=ages))
agesum <- group_by(df,Serial_No) %>%
summarize(minage=min(age),maxage=max(age))
Serial_Nos <- unique(df$Serial_No)
nSerial_Nos <- length(Serial_Nos)
cfs <- cis <- preds1 <- preds2 <- NULL
for (i in 1:nSerial_Nos) {
cat(Serial_Nos[i],"Loop\n")
tmp1 <- filter(df,Serial_No==Serial_Nos[i])
sv1 <- vbStarts(tl~age,data=tmp1)
fit1 <- nls(tl~vb(age,Linf,K,t0),data=tmp1,start=sv1)
cfs <- rbind(cfs,coef(fit1))
boot1 <- Boot(fit1)
tmp2 <- confint(boot1)
cis <- rbind(cis,c(tmp2["Linf",],tmp2["K",],tmp2["t0",]))
ages <- seq(-1,16,0.2)
boot2 <- Boot(fit1,f=predict2)
tmp2 <- data.frame(Serial_No=Serial_Nos[i],age=ages,
predict(fit1,data.frame(age=ages)),
confint(boot2))
preds1 <- rbind(preds1,tmp2)
tmp2 <- filter(tmp2,age>=agesum$minage[i],age<=agesum$maxage[i])
preds2 <- rbind(preds2,tmp2)
}
The code runs, but the results from the VBGF returned are the same for every serial no, which can't be the case. I think it's the filter function not working in the above code.
I've searched for solutions but can't get it to work.
If anyone can please help, or knows of a solution i'd really appreciate it.
Thank you in advance
Model fit with package growthrates
The following post describes an alternative approach without for-loop and filter. Similar loop-free solutions can be implemented using the common nls function and lapply in "base" R or group_by in "tidyverse".
Model definition
The growthrates package does not contain a von Bertalanffy function, so it has to be provided as user supplied model, as described in the package vignette. Here I borrowed the function from package FSA and adapted it accordingly:
library("growthrates")
grow_von_bert <- function(time, parms) {
with(as.list(parms), {
y <- Linf * (1 - exp(-K * (time - t0)))
as.matrix(data.frame(time = time, y = y))
})
}
Test of the model with a single example
p <- c(t0=5, Linf=10, K=.1)
time <- seq(5, 100)
plot(grow_von_bert(time, p), type="l")
Fit of a single data example
It is always a good idea to fit one or more single examples first, before doing this for all.
df1 <- subset(df, Serial_No == 315)
fit1 <- fit_growthmodel(df1$age, df1$tl,
FUN = grow_von_bert, p=c(t0=0, Linf=70, K=0.1))
summary(fit1)
Fit of all data sets
This can be done in a loop or with appropriate tidyverse functions, whipe package growthrates has such a function already built in, so all models can be fitted with a single function call. It is of course necessary to specify good start parameters, either the same for all curves or individual parameter sets, depending on the quality of the data. Here is the complete code including the data of the OP:
library("growthrates")
df <- data.frame(
Serial_No = factor(c(315,315,315,315,315,315,315,316,316,316,316,317,317,317,317,
317,317,317,317,317,318,318,318,318,319,319,319,319)),
year = c(1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,
1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945,1945),
tl = c(19,33,46,55,63,66,70,22,39,55,65,20,40,51,56,60,62,63,64,65,26,43,54,60,28,
43,53,61),
age = c(1,2,3,4,5,6,7,1,2,3,4,1,2,3,4,5,6,7,8,9,1,2,3,4,1,2,3,4)
)
grow_von_bert <- function(time, parms) {
with(as.list(parms), {
y <- Linf * (1 - exp(-K * (time - t0)))
as.matrix(data.frame(time = time, y = y))
})
}
fit <- all_growthmodels(tl ~ age | Serial_No,
data=df,
FUN = grow_von_bert,
p=c(t0=0, Linf=70, K=0.1))
results(fit)
par(mfrow=c(2,3))
plot(fit, las=1)

R: Extracting values from list of model summaries using pattern matching

I have a list of model summaries (let's say it is a linear model; but this could apply to any model summary).
Currently, I am extracting a certain coefficient from this list of summaries using the following:
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
How could I do this by calling the variable name instead of relying on row position?
For each of the model summaries within the list, only one variable differs, which is named as V_01, V_02, V_03 etc. This is the variable coefficient I would like to extract.
I was thinking of using the grep function, something like:
coef <- sapply(modelsummaries, function(x) x[[4]][[grep("^[V]"),1]])
...but haven't got it working. Any suggestions?
Here's a reproducible example (only the last line needs to be tweaked):
newdata <- as.data.frame(seq(from = 0.1, to = 0.9, by = 0.1))
newdata <- as.data.frame(t(newdata))
colnames(newdata) = newdata[1, ]
colnames(newdata) <- paste("V", colnames(newdata), sep = "_")
mtcars <- mtcars
mtcarsmodel <- data.frame(mtcars, newdata)
mtcarsmodel[c(12:20)] <- sample(1:100, 288, replace=TRUE)
xnam <- paste(colnames(mtcarsmodel)[c(4:5)], sep="")
xnam2 <- paste(colnames(mtcarsmodel)[c(12:20)], sep="")
fmla <- paste(xnam, collapse= "+")
fmla2 <- paste(paste(fmla), "+")
fla <- paste("mpg ~", paste(fmla2))
models <- lapply(setNames(xnam2, xnam2), function(var) {
form = paste(fla, var)
lm(form, data=mtcarsmodel)
})
modelsummaries <-lapply(models, summary)
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
You were quite close, you just needed to tell grep what to search on, which is the rownames of the coefficient matrix returned by coef() (which is a better way to get them than [[4]]). Also so as not to reuse that name, I suggest saving the result in something different, like coefs.
coefs <- sapply(modelsummaries, function(x) {
coef(x)[grep("^V", rownames(coef(x))),"Estimate"]
})
V_0.1 V_0.2 V_0.3 V_0.4 V_0.5 V_0.6 V_0.7 V_0.8
0.030927774 -0.053437459 0.009335911 -0.011009187 -0.010303494 -0.001705420 -0.036297492 0.021838044
V_0.9
0.005457086
Also, check out the new broom package which can make it easier to extract certain information from models in a tidy way.
After struggling with a grep solution, I committed blasphemy and used an sql solution instead:
library('sqldf')
new <- lapply(modelsummaries, function(x) setDT(data.frame(x[[4]]), keep.rownames = TRUE)[])
values <- sapply(new, function(x) sqldf("SELECT x.estimate, x.'Pr...t..' FROM x WHERE rn like '%V_%'"))
data <- as.data.frame(t(rbind(values)))
I've also come up with a (somewhat ugly) grep based solution:
coef <- sapply(modelsummaries, function(x) as.numeric(unlist(strsplit(grep("^V_", capture.output(x), value = TRUE), "\\s+"))[[2]]))

Function to regress chosen variable against all others

In my dataset I have 6 variables(x1,x2,x3,x4,x5,x6), i wish to create a function that allows me to input one variable and it will do the formula with the rest of the variables in the data set.
For instance,
fitRegression <- function(data, dependentVariable) {
fit = lm(formula = x1 ~., data = data1)
return(fit)
}
fitRegression(x2)
However, this function only returns me with results of x1. My desire result will be inputting whatever variables and will automatically do the formula with the rest of the variables.
For Example:
fitRegression(x2)
should subtract x2 from the variable list therefore we only compare x2 with x1,x3,x4,x5,x6.
and if:
fitRegression(x3)
should subtract x3 from the comparable list, therefore we compare x3 with x1,x2,x4,x5,x6.
Is there any ways to express this into my function, or even a better function.
You can do it like this:
# sample data
sampleData <- data.frame(matrix(rnorm(500),100,5))
colnames(sampleData) <- c("A","B","C","D","E")
# function
fitRegression <- function(mydata, dependentVariable) {
# select your independent and dependent variables
dependentVariableIndex<-which(colnames(mydata)==dependentVariable)
independentVariableIndices<-which(colnames(mydata)!=dependentVariable)
fit = lm(formula = as.formula(paste(colnames(mydata)[dependentVariableIndex], "~", paste(colnames(mydata)[independentVariableIndices], collapse = "+"), sep = "" )), data = mydata)
return(fit)
}
# ground truth
lm(formula = A~B+C+D+E, data = sampleData)
# reconcile results
fitRegression(sampleData, "A")
You want to select the Y variable in your argument. The main difficulty is to pass this argument without any quotes in your function (it is apparently the expected result in your code). Therefore you can use this method, using the combination deparse(substitute(...)):
fitRegression <- function(data, dependentVariable) {
formula <- as.formula(paste0(deparse(substitute(dependentVariable)), "~."))
return(lm(formula, data) )
}
fitRegression(mtcars, disp)
That will return the model.
The below function uses "purrr" and "caret" it produces a list of models.
df <-mtcars
library(purrr);library(caret)
#create training set
vect <- createDataPartition(1:nrow(df), p=0.8, list = FALSE)
#build model list
ModList <- 1:length(df) %>%
map(function(col) train(y= df[vect,col], x= df[vect,-col], method="lm"))

R formula: how to constrain calculations to two groups using formula?

I want to build post-hoc testing for log-rank test, that compare individual groups to each other:
library(survival)
survdiff(DV ~ IV, data=mydf)
Is there any way to do something like this (e.g. to compare group 2 with group 7):
survdiff(DV ~ I(if(as.numeric(IV) == 2) {1} else {if(as.numeric(IV) == 7) {2} else {NA}} ), data=mydf )
I know that I can filter out not needed rows from mydf and the Surv object DV
df2vs7<-mydf[as.numeric(mydf$IV)==2 | as.numeric(mydf$IV)==7,]
DV2vs7<-DV[as.numeric(mydf$IV)==2 | as.numeric(mydf$IV)==7,]
But I think it is highly inefficient; the computer would need to store each separate data objects for each combination of compared groups.
Use the subset argument to survdiff inside a (nested) loop. You might have to tweak the loops to avoid off-by-one errors, but you get the idea.
l <- list()
n <- <no. of groups>
mydf$IV <- as.numeric(mydf$IV)
for(i in seq_len(n - 1))
{
for(j in seq(i + 1, to=n, by=1))
{
l <- c(l, survdiff(DV ~ factor(IV), mydf, subset=IV %in% c(i, j)))
}
}
OK, this is a copy&paste solution. Suppose we have df a dataframe object, where we have IV categorical variable with length(levels(df$IV)) levels, and the DF object of type Surv and we want to perform pairwise log-rank tests between each pair of groups
library(gregmisc)
levels<-sort(unique(as.numeric((mydf$IV))))
groups<-combinations(length(levels),2,levels)
#or if we assume that levels produced by as.numeric(mydf$IV) are in sequence 1:n, we can use more efficient:
#groups<-combinations(length(levels(df$IV)),2)
library(plyr)
alply(groups, 1,
function(pair) {
survdiff(DV ~ factor(IV), mydf, subset=IV %in% c(pair[[1]], pair[[2]]))
}
)
The last expression returns the results
Something like this?
survdiff(DV ~ IV, data=mydf[ as.numeric(mydf$IV) %in% c(2,7),] )

Resources