Loop mixed models and tukey comparison - r

I would like to loop a mixed model and Tukey test.
All I want to do is to repeat the fitting and the comparison for 3 columns (each one containing a response) and for 4 subgroups (total = 12).
A similar dataframe is available here: https://drive.google.com/open?id=0Bwrsa11LAnrgTXMzWk1fYXR1MHM. The 3 resposes are the columns "RESP_1","RESP_2" and "RESP_3", the subgroups are the variables of column "layer".
I obtain the model and the adjustament for a single response and unique layer by:
#mixed model
Mlm_RESP_1 <-lme(RESP_1~clay+till, random=~1|strata/point, data=loop_lm_tukey)
#tukey
ls_RSP_1 <- lsmeans(Mlm_RESP_1,pairwise~till,adjust="tukey")
ls_RSP_1$contrasts
cld(ls_RSP_1)
Then, I try to loop the model for each column by:
#loop model
mlm_RESP <- lapply(c("RESP_1", "RESP_2","RESP_3"), function(k) {
lme(eval(substitute(j ~ clay+till, list(j = as.name(k)))), random = ~1|strata/point, data = loop_lm_tukey)})
From now, I am not able to loop the Tukey comparison using lsmeans package because lapply returns a list and this package can't handle this kind of class.
Furthermore, how can I loop this for every layer?
Any help to do loop of Tukey's comparison would be much appreciated.

Try this:
mlm_RESP <- lapply(c("RESP_1", "RESP_2","RESP_3"), function(k) {
df=cbind(resp=loop_lm_tukey[,k],loop_lm_tukey[,-c(1:3)])
lme(resp~clay+till, random = ~1|strata/point, data = df)})
res1=lapply(mlm_RESP,function(rm)lsmeans(rm,pairwise~till,adjust="tukey"))
or:
res2=list()
for (i in 1:3) res2[[i]]=lsmeans(mlm_RESP[[i]],pairwise~till,adjust="tukey")
The results is the same.

Related

Using a For Loop to Run Multiple Response Variables through a Train function to create multiple seperate models in R

I am trying to create a for loop to index thorugh each individual response variable I have and train a model using the train() funciton within the Caret Package. I have about 30 response variable and 43 predictor variables. I can train each model individually but I would like to automate the process and have a for loop run through a model (I would like to eventually upscale to multiple models if possible, i.e. lm, rf, cubist, etc.). I then want to save each model to a dataframe along with R-squared values and RMSE values. The individual models that I currenlty have that will run for me goes as follows, with column 11 being the response variable and column 35-68 being predictor variables.
data_Mg <- subset(data_3, !is.na(Mg))
mg.lm <- train(Mg~., data=data_Mg[,c(11,35:68)], method="lm", trControl=control)
mg.cubist <- train(Mg~., data=data_Mg[,c(11,35:68)], method="cubist", trControl=control)
mg.rf <- train(Mg~., data=data_Mg[,c(11,35:68)], method="rf", trControl=control, na.action = na.roughfix)
max(mg.lm$results$Rsquared)
min(mg.lm$results$RMSE)
max(mg.cubist$results$Rsquared)
min(mg.cubist$results$RMSE)
max(mg.rf$results$Rsquared) #Highest R squared
min(mg.rf$results$RMSE)
This gives me 3 models with everything the relevant information that I need. Now for the for loop. I've only tried the lm model so far for this.
bucket <- list()
for(i in 1:ncol(data_4)) { # for-loop response variables, need to end it at response variables, rn will run through all variables
data_y<-subset(data_4, !is.na(i))#get rid of NA's in the "i" column
predictors_i <- colnames(data_4)[i] # Create vector of predictor names
predictors_1.1 <- noquote(predictors_i)
i.lm <- train(predictors_1.1~., data=data_4[,c(i,35:68)], method="lm", trControl=control)
bucket <- i.lm
#mod_summaries[[i - 1]] <- summary(lm(y ~ ., data_y[ , c("i.lm", predictors_i)]))
#data_y <- data_4
}
Below is the error code that I am getting, with Bulk_Densi being the first variable in predictors_1.1. The error code is that variable lengths differ so I originally thought that my issue was that quotes were being added around "Bulk_Densi" but after trying the NoQuote() function I have not gotten anywehre so I am unsure of where I am going wrong.
Error code that I am getting
Please let me know if I can provide any extra info and thanks in advance for the help! I've already tried the info in How to train several models within a loop for and was struggling with that as well.

Insert multiple variables in the lda function from a list R

I have 6 variables for which I want to test which one is the best combination for a linear discriminant analysis lda .
I created a list with all the combinations.
I would like to loop through this list and run a lda for each combination
The lda formula wants column names to be specified with a + as follow:
lda(classification~ variable1+variable2, data=mydata)
However if I insert the value of my list in the lda function I get an error
unlist(mylist[i])
"variable1" "variable2"
Error in model.frame.default(formula = mylist ~ unlist(mylist[i]), :
variable lengths differ
reproducible example (variables are constant for illustrative purpose)
classification<-c("a","b","c","d","e","f")
variable1<-c(1,1,1,1,1,1)
variable2<-c(1,1,1,1,1,1)
variable3<-c(1,1,1,1,1,1)
variable4<-c(1,1,1,1,1,1)
variable5<-c(1,1,1,1,1,1)
variable6<-c(1,1,1,1,1,1)
mydata<-data.frame("classification","variable1","variable2","variable3","variable4","variable5","variable6")
para_combo1<-combn(mydata[2:7],1, simplify = FALSE)
para_combo2<-combn(mydata[2:7],2, simplify = FALSE)
para_combo3<-combn(mydata[2:7],3, simplify = FALSE)
para_combo4<-combn(mydata[2:7],4, simplify = FALSE)
para_combo5<-combn(mydata[2:7],5, simplify = FALSE)
para_combo6<-combn(mydata[2:7],6, simplify = FALSE)
para_combo<-c(para_combo1,para_combo2, para_combo3,
para_combo4,para_combo5, para_combo6)
#manual example
lda_table<-lda(classification~ variable1+variable2, data= mydata)
#example I would loop
lda_table<-lda(classification~ para_combo[7] , data= mydata)
I do not know how I could code my combination in the format lda requires
Apart from providing a formula, you can alternatively provide the features and the classes in the parameters x and grouping, respectively:
lda.result <- lda(x=mydata[,c(1,3)], grouping=mydata$classification)
# or simply:
lda.result <- lda(mydata[,c(1,3)], mydata$classification)
Note that the function lda in R actually does not only work with two variables, but with an arbitrary number of variables (sometimes called "multiple discriminant analysis"). There is thus no need to try out all pairs of variable combinations, but you can let lda figure it out for itself.

How to use expand.grid values to run various model hyperparameter combinations for ranger in R

I've seen various posts on how to select the independent variables for a model by using expand.grid and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.
library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris) # let's assume these are different input tables
I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger) for my list of input tables. I do the following to set up the grid:
hyper_grid <- expand.grid(
Input_table = names(Input_list),
Trees = c(10, 20),
Importance = c("none", "impurity"),
Classification = TRUE,
Repeats = 1:5,
Target = "Species")
> head(hyper_grid)
Input_table Trees Importance Classification Repeats Target
1 iris1 10 none TRUE 1 Species
2 iris2 10 none TRUE 1 Species
3 iris1 20 none TRUE 1 Species
4 iris2 20 none TRUE 1 Species
5 iris1 10 impurity TRUE 1 Species
6 iris2 10 impurity TRUE 1 Species
My question is, what is the best way to pass this values to the model? Currently I'm using a for loop:
for (i in 1:nrow(hyper_grid)) {
RF_train <- ranger(
dependent.variable.name = hyper_grid[i, "Target"],
data = Input_list[[hyper_grid[i, "Input_table"]]], # referring to the named object in the list
num.trees = hyper_grid[i, "Trees"],
importance = hyper_grid[i, "Importance"],
classification = hyper_grid[i, "Classification"]) # otherwise regression is performed
print(RF_train)
}
iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply for this role? This way of iterating also results in messy function calls:
Call:
ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i, "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i, "Importance"], classification = hyper_grid[i, "Classification"])
Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix) and write the results into an extended version of the hyper_grid on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.
Thank you!
I think it is cleanest to wrap the training and extraction of the values you need into a function. The dots (...) are needed for usage with the purrr::pmap function below.
fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
RF_train <- ranger(
dependent.variable.name = Target,
data = Input_list[[Input_table]], # referring to the named object in the list
num.trees = Trees,
importance = Importance,
classification = Classification) # otherwise regression is performed
data.frame(Prediction_error = RF_train$prediction.error,
True_positive = RF_train$confusion.matrix[1])
}
Then you can add the results as a column by mapping over the rows using for example purrr::pmap:
hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)
By mapping in this way, the function is applied row by row, so you should not run into memory issues.
The result of purrr::pmap is a list, which means that the column res contains a list for every row. This can be unnested using tidyr::unnest to spread the elements of that list across your data frame.
tidyr::unnest(hyper_grid, res)
I think this approach is very elegant, but it requires some tidyverse knowledge. I highly recommend this book if you want to know more about that. Chapter 25 (Many models) describes an approach similar to the one I'm taking here.

Loop through a list of variables to add to a base survival model then keep the key output in a table

Two-part question:
Firstly, I have a list of n variables in a data frame that I want to sequentially substitute into a survival model (thus creating n new models), and from the output of each, I want to retain only the summary table line (HR, SE's etc) related to that variable (so an n-row table).
#create list of variables from dataset
bloods <- colnames(data)[c(123,127, 129:132, 135:140, 143:144, 190:195)]
then loop through creating a new model each time. The following doesn't work but not sure why...
for (i in 1:length(bloods)){
x <- coxph(Surv(time, event) ~ i + var1+var2+var3, data=data, na.action=na.omit)
}
Not sure how to select and append the first row of the summary table (summary(x)[7]) to a table each time? I suppose I must create the table before the loop?
Any help very much appreciated!
Consider lapply on a dynamic formula build which will result in a list of summary tables:
bloods <- colnames(data)[c(123,127, 129:132, 135:140, 143:144, 190:195)]
sumtables <- lapply(bloods, function(i) {
# STRING INTERPOLATION WITH sprintf, THEN CONVERTED TO FORMULA OBJECT
iformula <- as.formula(sprintf("Surv(time, event) ~ %s + var1+var2+var3", i))
# RUN MODEL REFERENCING DYNAMIC FORMULA
x <- coxph(iformula, data=data, na.action=na.omit)
# RETURN COEFF MATRIX RESULTS
summary(x)[7][[1]]
})

apply series of commands to split data frame

I'm having some difficulties figuring out how to approach this problem. I have a data frame that I am splitting into distinct sites (link5). Once split I basically want to run a linear regression model on the subsets. Here is the code I'm working with, but it's definitely not correct. Also, It would be great if I could output the model results to a new data frame such that each site would have one row with the model parameter estimates - that is just a wish and not a necessity right now. Thank you for any help!
les_events <- split(les, les$link5)
result <- lapply(les_events) {
lm1 <-lm(cpe~K,data=les_events)
coef <- coef(lm1)
q.hat <- -coef(lm1)[2]
les_events$N0.hat <- coef(lm1[1]/q.hat)
}
You have a number of issues.
You haven't passed a function (the FUN argument) to lapply
Your closure ( The bit inside {} is almost, but not quite the body you want for your function)
something like th following will return the coefficients from your models
result <- lapply(les_events, function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
This will return a list of data.frames containing columns for each coefficient.
lapply(les_events, lm, formula = 'cpe~K')
will return a list of linear model objects, which may be more useful.
For a more general split / apply / combine approaches use plyr or data.table
data.table
library(data.table)
DT <- data.table(les)
result <- les[, {lm1 <- lm(cpe ~ K, data = .SD)
as.list(lm1)}, by = link5]
plyr
library(plyr)
result <- ddply(les, .(link5), function(DD){
lm1 <-lm(cpe~K,data=DD)
coef <- coef(lm1)
data.frame(as.list(coef))
})
# or to return a list of linear model objects
dlply(les, link5, function(DD){ lm(cpe ~K, data =DD)})

Resources