Combining multiple function arguments inside list2serv(lapply(),) - r

I'm working with ten training datasets, train1 through train10, and would like to repeat the following statements for 1 through 10 with a single block of code:
train_y_1 <- c(train1$y)
train1$y <-NULL
train_x_1 <- data.matrix(train1)
olsfit_1 <- cv.glmnet(y=train_y_1, x=train_x_1, alpha=1, family="gaussian")
I've read in the forums that lapply() is preferable to for loops. My code:
# Create empty data frames and list (to be populated with values in main program)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10)), envir=.GlobalEnv)
# Create y and x input matrices and run ten lasso regressions
list2env(lapply(mget(paste0('train', 1:10)), mget(paste0('train_y_', 1:10)), mget(paste0('train_x_', 1:10)), mget(paste0('lasso_', 1:10)),
function(a,b,c,d)
{
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}), envir=.GlobalEnv)
which produces the error message:
Error in match.fun(FUN) :
'mget(paste0("train_y_", 1:10))' is not a function, character or symbol
So it looks like R is confused by the four mget() functions which I intended to be reading in values for the a,b,c,d arguments, but I'm not sure how to proceed next.
Any suggestions?

You want to keep all your data in lists whenever possible, avoiding polluting the global environment with a bunch of variables. This isn't tested, and train is missing, but should be a similar list of your train data. Then, you could do something like,
trainy <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10))
trainx <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10))
lasso <- setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10))
f <- function(a,b,c,d) {
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}
mapply(f, train, trainy, trainx, lasso, SIMPLIFY=F)
Although, since your lists are just initializing variables, you probably just want to loop (apply) over a list of your training data,
lapply(train, function(x) {
... # the statements you want to repeat
list(...) # return a list of the three data.frames
})

We can achieve this with the following code.
# Load libraries
library(dplyr);library(glmnet)
# Gather all the variables in global into a list
fit = mget(paste0("train", 1:10), envir = .GlobalEnv) %>%
# Pipe each element of the list into `cv.glmnet` function
lapply(function(dat) {cv.glmnet(y = dat$y,
x = data.matrix(dat %>% mutate(y = NULL)),
alpha = 1,
family = "gaussian")})
Your output will be neatly stored in fit, which is a list with 10 elements. You can call each element with fit[[i]]. For example coef(fit[[1]]) pulls out the coefs for train1 and lapply(fit, coef) pulls the coef for all 10 models and stores them in a list.

Related

Expand for-loop to accommodate list in R?

I've recently been interested in trying to develop a for-loop that would be able to run multiple generalized additive models and then produce results in a table that ranks them based on AIC, p-value of each smooth in the model, deviance explained of the overall model, etc.
I found this related question in stack overflow which is basically what I want and was able to run this well for gam() instead of gamm(), however I want to expand this to include multiple independent variables in the model, not just 1.
Ideally, the models would run all possible combinations of independent variables against the dependent variable, and it would test combinations anywhere from 1 independent variable in the model, up to all of the possible covariates in "d_pred" in the model.
I have attempted to do this so far by starting out small and finding all possible combinations of 2 independent variables (df_combinations2), which results in a list of data frames. Then I adjusted the rest of the code to run the for loop such that each iteration will run a different combination of the two variables:
library(mgcv)
## Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
d_resp <- d[ c("y", "y1")]
d_pred <- d[, !(colnames(d) %in% c("y", "y1"))]
df_combinations2 <- lapply(1:(ncol(combn(1:ncol(d_pred), m = 2))),
function(y) d_pred[, combn(1:ncol(d_pred), m = 2)[,y]])
## create a "matrix" list of dimensions i x j
results_m2 <-lapply(1:length(df_combinations2), matrix, data= NA, nrow=ncol(d_resp), ncol=2)
## for-loop
for(k in 1:length(df_combinations2)){
for(i in 1:ncol(d_resp)){
for(j in 1:ncol(df_combinations2[[k]])){
results_m2[i, j][[1]] <- gam(d_resp[, i] ~ s(df_combinations2[[k]][,1])+s(df_combinations2[[k]][,2]))
}
}}
However, after running the for-loop I get the error "Error in all.vars1(gp$fake.formula[-2]) : can't handle [[ in formula".
Anyone know why I am getting this error/ how to fix it?
Any insight is much appreciated. Thanks!
Personally, I would create a data.table() containing all combinations of target variables and combinations of predictors and loop through all rows. See below.
library(data.table)
library(dplyr)
# Example data
set.seed(0)
dat <- gamSim(1,n=200,scale=2)
set.seed(1)
dat2 <- gamSim(1,n=200,scale=2)
names(dat2)[1:5] <- c("y1", paste0("x", 4:7))
d <- cbind(dat[, 1:5], dat2[, 1:5])
#select names of targets and predictors
targets <- c("y", "y1")
predictors <- colnames(d)[!colnames(d) %in% targets]
#create all combinations of predictors
predictor_combinations <- lapply(1:length(predictors), FUN = function(x){
#create combination
combination <- combn(predictors, m = x) |> as.data.table()
#add s() to all for gam
combination <- sapply(combination, FUN = function(y) paste0("s(", y, ")")) |> as.data.table()
#collapse
combination <- summarize_all(combination, .funs = paste0, collapse = "+")
#unlist
combination <- unlist(combination)
#remove names
names(combination) <- NULL
#return
return(combination)
})
#merge combinations of predictors as vector
predictor_combinations <- do.call(c, predictor_combinations)
#create folder to save results to
if(!dir.exists("dev")){
dir.create("dev")
}
if(!dir.exists("dev/models")){
dir.create("dev/models")
}
#create and save hypergrid (all combinations of targets and predictors combinations)
if(!file.exists("dev/hypergrid.csv")){
#create hypergrid and save to dev
hypergrid <- expand.grid(target = targets, predictors = predictor_combinations) |> as.data.table()
#add identifier
hypergrid[, model := paste0("model", 1:nrow(hypergrid))]
#save to dev
fwrite(hypergrid, file = "dev/hypergrid.csv")
} else{
#if file exists read
hypergrid <- fread("dev/hypergrid.csv")
}
#loop through hypergrid, create GAM models
#progressbar
pb <- txtProgressBar(min = 1, max = nrow(hypergrid), style = 3)
for(i in 1:nrow(hypergrid)){
#update progressbar
setTxtProgressBar(pb, i)
#select target
target <- hypergrid[i,]$target
#select predictors
predictors <- hypergrid[i,]$predictors
#create formula
gam.formula <- as.formula(paste0(target, "~", predictors))
#run gam
gam.model <- gam(gam.formula, data = d)
#save gam model do dev/model
saveRDS(gam.model, file = paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
}
#example where you extract model performances
for(i in 1:nrow(hypergrid)){
#read the right model
rel.model <- readRDS(paste0("dev/models/", hypergrid[i,]$model, ".RDS"))
#extract model performance, add to hypergrid
hypergrid[i, R2 := summary(rel.model)[["r.sq"]]]
}
#arrange hypergrid on target and r2
hypergrid <- dplyr::arrange(hypergrid, hypergrid$target, desc(hypergrid$R2))
Which would give
head(hypergrid)
target predictors model R2
1: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5) model319 0.6957242
2: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5) model423 0.6953753
3: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x7) model437 0.6942054
4: y s(x0)+s(x1)+s(x2)+s(x5) model175 0.6941025
5: y s(x0)+s(x1)+s(x2)+s(x4)+s(x5)+s(x6) model435 0.6940569
6: y s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5)+s(x7) model481 0.6939756
All models are saved to a folder with an identifier (for if you want to use the model or extract more information from the model).
Notably, p-hacking comes to mind using this appraoch and I would be careful by conducting your analysis like this.

R: Extracting values from list of model summaries using pattern matching

I have a list of model summaries (let's say it is a linear model; but this could apply to any model summary).
Currently, I am extracting a certain coefficient from this list of summaries using the following:
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
How could I do this by calling the variable name instead of relying on row position?
For each of the model summaries within the list, only one variable differs, which is named as V_01, V_02, V_03 etc. This is the variable coefficient I would like to extract.
I was thinking of using the grep function, something like:
coef <- sapply(modelsummaries, function(x) x[[4]][[grep("^[V]"),1]])
...but haven't got it working. Any suggestions?
Here's a reproducible example (only the last line needs to be tweaked):
newdata <- as.data.frame(seq(from = 0.1, to = 0.9, by = 0.1))
newdata <- as.data.frame(t(newdata))
colnames(newdata) = newdata[1, ]
colnames(newdata) <- paste("V", colnames(newdata), sep = "_")
mtcars <- mtcars
mtcarsmodel <- data.frame(mtcars, newdata)
mtcarsmodel[c(12:20)] <- sample(1:100, 288, replace=TRUE)
xnam <- paste(colnames(mtcarsmodel)[c(4:5)], sep="")
xnam2 <- paste(colnames(mtcarsmodel)[c(12:20)], sep="")
fmla <- paste(xnam, collapse= "+")
fmla2 <- paste(paste(fmla), "+")
fla <- paste("mpg ~", paste(fmla2))
models <- lapply(setNames(xnam2, xnam2), function(var) {
form = paste(fla, var)
lm(form, data=mtcarsmodel)
})
modelsummaries <-lapply(models, summary)
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
You were quite close, you just needed to tell grep what to search on, which is the rownames of the coefficient matrix returned by coef() (which is a better way to get them than [[4]]). Also so as not to reuse that name, I suggest saving the result in something different, like coefs.
coefs <- sapply(modelsummaries, function(x) {
coef(x)[grep("^V", rownames(coef(x))),"Estimate"]
})
V_0.1 V_0.2 V_0.3 V_0.4 V_0.5 V_0.6 V_0.7 V_0.8
0.030927774 -0.053437459 0.009335911 -0.011009187 -0.010303494 -0.001705420 -0.036297492 0.021838044
V_0.9
0.005457086
Also, check out the new broom package which can make it easier to extract certain information from models in a tidy way.
After struggling with a grep solution, I committed blasphemy and used an sql solution instead:
library('sqldf')
new <- lapply(modelsummaries, function(x) setDT(data.frame(x[[4]]), keep.rownames = TRUE)[])
values <- sapply(new, function(x) sqldf("SELECT x.estimate, x.'Pr...t..' FROM x WHERE rn like '%V_%'"))
data <- as.data.frame(t(rbind(values)))
I've also come up with a (somewhat ugly) grep based solution:
coef <- sapply(modelsummaries, function(x) as.numeric(unlist(strsplit(grep("^V_", capture.output(x), value = TRUE), "\\s+"))[[2]]))

Loop through (subsets) using jags

I have a big dataframe with 10000 rows and 12 columns (discountdataset).
The columns contain different variables. The first 210 rows represents subject 1 (there is also a column with "subject1"), the next 210 rows represent subject 2, and so on.
I want to use jags and a loop function to loop through all 52 subjects in the dataframe, and assign a function to each of them. My code looks like this:
#subsetting the dataframe by the variable subjectid
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
Here my plan is to loop and assign the following jags function to all subjects in the subset), but, it doesn't work. I think my mistake is that the variables "nt", "Choice" that I want to pass on to jags are not defined right, or, are not updated.
library(rjags)
for (i in 1:length(subsetdiscount))
{
nt <- nrow (subsetdiscount)
Choice <- subsetdiscount$choice
amountSS <- subsetdiscount$val_basic
amountLL <- subsetdiscount$val_d
delayDIFF <- subsetdiscount$delay
con <- subsetdiscount$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(
list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt", n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
Try:
library(rjags)
library(R2jags)
subsetdiscount <- split(discountdataset, as.factor(discountdataset$subjectid))
output_models <- lapply(subsetdiscount, function(x) {
nt <- nrow(x)
Choice <- x$choice
amountSS <- x$val_basic
amountLL <- x$val_d
delayDIFF <- x$delay
con <- x$condition
data <- list("nt", "Choice", "amountSS", "amountLL", "delayDIFF", "con") # to be passed on to JAGS
myinits <- list(list(k = (c(0.01, 0.01))),
list(temp = (c(6, 6))))
parameters <- c("k", "temp")
samples <- jags(data, inits=myinits, parameters,
model.file ="singlesubmodel_Ben_roundedchoice.txt",
n.chains=2, n.iter=20000,
n.burnin=1, n.thin=1, DIC=T)
return(samples)
})
output_models should be a list containing outputs for each of the factors you split main dataset by.
Please note that it is quite hard to test this without any provided data. So, if this fails to work, you may want to provide some data for testing.
I hope it helps.

lapply on dataframe list using different column index

I'm try to do a feature selection on a dataframe list using the caret package. I have different dataframes but the last 6 columns are the same. When I am trying to apply the model on a single df it works fine
# For a single dataframe
mx.chem # the name of my single dataframe
#define the control
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
# run the rfe
mx.results <- rfe(mx.chem[,1:22], mx.chem[,23], sizes = c(1:22), rfeControl = mx.control)
print(mex.results)
but My problem is when I try to use lapply on a list of df. The code I have until now is
require(mlbench)
require(caret)
mylist # is a df list containing 3 df
for (i in 1:3) {
my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10) # define the control
longdata <- length(i)-6
idxindustry <- longdata +1
my.results <- lapply(mylist, function(x) rfe ( x[,1:longdata], x[,idxindustry], data = x, sizes =c(1:longdata), rfeControl = my.control))
}
I'm not sure if I'm using column index properly. Does anyone have an idea how to fix to make my code work. Thanks
Here are two possible ways:
#Using lapply
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
rfe.lst <- lapply(mylist,
function(x) {
longdata <- ncol(x)-6
rfe ( x[,1:longdata], x[,longdata + 1],
sizes =c(1:longdata),
rfeControl = mx.control)
})
#For loop
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
rfe.lst <- vector("list", 3)
for(i in 1:3) {
longdata <- ncol(mylist[[i]])-6
rfe.lst[[i]] <- rfe(mylist[[i]][,1:longdata], x[,longdata + 1],
sizes=c(1:longdata),
rfeControl=mx.control)
}
Your code doesn't do what you think. length(i) will always be 1, because i is your loop index and takes the values 1 to 3. You mean to do:
length(mylist[[i]])
Note the double brackets. That's how you select the element from the list, in this case the data frame. If you use single brackets, you get back a list with the elements you want.
But that's still not what you aim to achieve. If you would change that line in your code, you have 2 loops:
an outer loop that creates longdata and idxindustry based on a single data frame each time.
an inner lapply loop that uses the values for longdata and idxindustry on all three dataframes.
Remember that lapply takes each element in the list and passes it as the first argument to the function you specify. So you can do this in a single lapply like this:
my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
my.results <- lapply(mylist, function(x){
# x becomes one of the data frames in the list mylist here, so you can
# treat it like a data frame in the code below
longdata <- length(x) - 6
idxindustry <- longdata +1
rfe( x[,1:longdata], x[,idxindustry], data = x,
sizes =c(1:longdata), rfeControl = my.control)
})
And then you run rfe with longdata and idxindustry based on the data frame at hand. Note I put the call to rfeControl outside the lapply loop for performance.

For i loop, calling different dataframes

I'm new to loops and I have a problem with calling variable from i'th data frame.
I'm able to call each data frame correctly, but when I should call a specified variable inside each data frame problems come:
Example:
for (i in 1:15) {
assign(
paste("model", i, sep = ""),
(lm(response ~ variable, data = eval(parse(text = paste("data", i, sep = "")))))
)
plot(data[i]$response, predict.lm(eval(parse(text = paste("model", i, sep = ""))))) #plot obs vs preds
}
Here I'm doing a simple one variable linear model 15 times, which works just fine. Problems come when I try to plot the results. How should I call data[i] response?
Let's say there are multiple dataframes with names: data1 ...data15 and that there are no other data-objects that begin with the letters: d,a,t,a. Lets also assume that in each of those dataframes are columns named 'response' and 'variable'. The this would gather the dataframes into a list and draw separate plots for the linear regression lines.
dlist <- lapply ( ls(patt='^data'), get)
lapply(dlist, function(df)
plot(NA, xlim=range(df$variable), ylim=range(df$response)
abline( coef( lm(response ~ variable, data=df) ) )
)
If you wanted to name the dataframes in that list, you could use your paste code to supply names:
names(dlist) <- paste("data", i, sep = "")
There are many other assignments you could make in the context of this loop, but you would need to describe the desired results better than with failed efforts.
Here's modified code that should work. It does one variable lm-model and calculates correlation of predicted and observed values and stores it into an empty matrix. It also plots these values.
Thanks Thomas for help.
par(mfrow=c(4,5))
results.matrix <- matrix(NA, nrow = 20, ncol = 2)
colnames(results.matrix) <- c("Subset","Correlation")
for (i in 1:length(datalist)) {
model <- lm(response ~ variable, data = datalist[[i]])
pred <- predict.lm(model)
cor <- (cor.test(pred, datalist[[i]]$response))
plot(pred, datalist[[i]]$response, xlab="pred", ylab="obs")
results.matrix[i, 1] <- i
results.matrix[i, 2] <- cor$estimate
}

Resources