R Caret Train Input values - r

I want to make a list of input values for the caret train function to ignore. So far I can do it and it works, however, it has to be done withing the train function.
Example:
LabCa_R1_Fit <- train(LabCa ~ . -EV1 -kgpm -Fe ,...)
The -EV1 -kgpm -Fe is me removing the values, however, I want it in the form of:
list <- c(-EV1, -kgpm, -Fe)
LabCa_R1_Fit <- train (LabCa ~ . list, ...)
The problem is when I put the options to delete outside the train function they area treated as variables instead of options and I get the appropriate error. How do I create a list of the options I want?

There is also an undocumented feature that will allow you do use:
mod <- train(Species ~ ., data = iris, method = "lda", preProc = list(ignore = "Sepal.Width"))

I found the solution by doing the following:
# Outside
list <- LabCa ~ . -EV13
# Inside
LabCa_R1_Fit <- train( list , ... )

Related

Caret train function for muliple data frames as function

there has been a similar question to mine 6 years+ ago and it hasn't been solve (R -- Can I apply the train function in caret to a list of data frames?)
This is why I am bringing up this topic again.
I'm writing my own functions for my big R project at the moment and I'm wondering if there is an opportunity to sum up the model training function train() of the pakage caret for different dataframes with different predictors.
My function should look like this:
lda_ex <- function(data, predictor){
model <- train(predictor ~., data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
Using it afterwards should work like this:
data_iris <- iris
predictor_iris <- "Species"
iris_res <- lda_ex(data = data_iris, predictor = predictor_iris)
Unfortunately the R formula is not able to deal with a variable as input as far as I tried.
Is there something I am missing?
Thank you in advance for helping me out!
Solving this would help me A LOT to keep my function sheet clean and safe work for sure.
By writing predictor_iris <- "Species", you are basically saving a string object in predictor_iris. Thus, when you run lda_ex, I guess you incur in some error concerning the formula object in train(), since you are trying to predict a string using vectors of covariates.
Indeed, I tried the following toy example:
X = rnorm(1000)
Y = runif(1000)
predictor = "Y"
lm(predictor ~ X)
which gives an error about differences in the lengths of variables.
Let me modify your function:
lda_ex <- function(data, formula){
model <- train(formula, data,
method = "lda",
trControl = trainControl(method = "none"),
preProc = c("center","scale"))
return(model)
}
The key difference is that now we must pass in the whole formula, instead of the predictor only. In that way, we avoid the string-related problem.
library(caret) # Recall to specify the packages needed to reproduce your examples!
data_iris <- iris
formula_iris = Species ~ . # Key difference!
iris_res <- lda_ex(data = data_iris, formula = formula_iris)

Include an object within a function only if it exists

I have a loop that needs to be executed; within which are 6 models. The objects that those models are stored in then need to get passed into a function that executes an AIC analysis. However, sometimes one of the models does not work, which then breaks the code for the AIC function because it does not recognize whatever model that failed because it was not stored as an object.
So, I need a way to pull those models that worked into the AIC function.
Here is an example, but keep in mind it is important that this can all be executed within a loop. Here are three hypothetical models:
hn.1 <- ds(data)
hn.1.obs <- ds(data,formula = ~OBSCODE)
hn.1.obs.mas <- ds(dataformula = ~OBSCODE+MAS)
And this would be my AIC function that compares the models:
summarize_ds_models(hn.1, hn.1.obs, hn.1.obs.mas)
But I get an error if say, the hn.1.obs.mas model failed.
I tried to use "get" and "ls" and I successfully pull the models that exist when I call:
get(ls(pattern='hn.15*'))
But that just returns a character vector, so that when I call:
summarize_ds_models(get(ls(pattern='hn.15*')))
it only conducts the AIC analysis on the first model in the above character vector.
Am I on the right track or is there a better way to do this?
UPDATE with a reproducible example.
Here is a simplified version of my problem:
create and fill two data frames that will be put into a list:
data.frame <- data.frame(x = integer(4),
y = integer(4),
z = integer(4),
i = integer(4))
data.frame$x <- c(1,2,3,4)
data.frame$y <- c(1,4,9,16)
data.frame$z <- c(1,3,8,10)
data.frame$i <- c(1,5,10,15)
data.frame.2 <- data.frame[1:4,1:3]
my.list <- list(data.frame,data.frame.2)
create df to fill with best models from AIC analyses
bestmodels <- data.frame(modelname = character(2))
Here is the function that will run the loop:
myfun <- function(list) {
for (i in 1:length(my.list)){
mod.1 = lm(y ~ x, data = my.list[[i]])
mod.2 = lm(y ~ x + z, data = my.list[[i]])
mod.3 = lm(y ~ i, data = my.list[[i]])
bestmodels[i,1] <- rownames(AIC(mod.1,mod.2,mod.3))[1]#bestmodel is 1st row
}
print(bestmodels)
}
However, on the second iteration of the loop, the AIC function will fail because mod.3 will fail. So, is there a generic way to make it so the AIC function will only execute for those models that worked? The outcome I would want here would be:
> bestmodels
modelname
1 mod.1
2 mod.1
since mod.1 would be chosen for both AIC analyses.
Gregor's comment:
Use a list instead of individual named objects. Then do.call(summarize_ds_models, my_list_of_models). If it isn't done already, you can Filter the list first to make sure only working models are in the list.
solved my problem. Thanks

Pass df column names to nested equation in Graph Printing Function

I need some clarification on the primary post on Passing a data.frame column name to a function
I need to create a function that will take a testSet, trainSet, and colName(aka predictor) as inputs to a function that prints a plot of the dataset with a GAM model trend line.
The issue I run into is:
plot.model = function(predictor, train, test) {
mod = gam(Response ~ s(train[[predictor]], spar = 1), data = train)
...
}
#Function Call
plot.model("Predictor1", 1.0, crime.train, crime.test)
I can't simply pass the predictor as a string into the gam function, but I also can't use a string to index the data frame values as shown in the link above. Somehow, I need to pass the colName key to the game function. This issue occurs in other similar scenarios regarding plotting.
plot <- ggplot(data = test, mapping = aes(x=predictor, y=ViolentCrimesPerPop))
Again, I can't pass a string value for the column name and I can't pass the column values either.
Does anyone have a generic solution for these situations. I apologize if the answer is buried in the above link, but it's not clear to me if it is.
Note: A working gam function call looks like this:
mod = gam(Response ~ s(Predictor1, spar = 1.0), data = train)
Where the train set is a data frame with column names "Response" & "Predictor".
Use aes_string instead of aes when you pass a column name as string.
plot <- ggplot(data = test, mapping = aes_string(x=predictor, y=ViolentCrimesPerPop))
For gam function:: Example which is copied from gam function's documentation. I have used vector, scalar is even easier. Its just using paste with a collapse parameter.
library(mgcv)
set.seed(2) ## simulate some data...
dat <- gamSim(1,n=400,dist="normal",scale=2)
# String manipulate for formula
formula <- as.formula(paste("y~s(", paste(colnames(dat)[2:5], collapse = ")+s("), ")", sep =""))
b <- gam(formula, data=dat)
is same as
b <- gam(y~s(x0)+s(x1)+s(x2)+s(x3),data=dat)

how to loop through a list of variable names to use it with the update function from lm

I have a list of variable names that I would like to sequentially exclude from a best fitted model using the function update from lm. Because the list of variables are likely to change I want to loop through a given list but I can not get the elements of the list to be read as dependent variable.
I found some code that I thought it could work:
Example code
hsb2 <-read.csv("www.ats.ucla.edu/stat/data/hsb2.csv")
names(hsb2)
varlist <- names(hsb2)[8:11]
models <- lapply(varlist, function(x) {
lm(substitute(read ~ i, list(i = as.name(x))), data = hsb2)
})
But not if I use the update function on a previous lm object
words<-c('Age','Sex', 'Residuals')
models <- lapply(words, function(x){update(substitute(
lmobject,~.-i,list(i = as.name(x))),data =data_complete)})
I also tried
re<-c()
for (i in 1:3) {
lmt<-update(lmobject,~.-words[i])
r2no_i<-summary(lmt)$r.squared
re<-c(re, r2no_i)
}
I think this is pretty simple but I could not make the variable to be read properly
Any tip is highly appreciated
Is it possible that the built-in stats::drop1() function would do what you need?
Read data (note "http://..." is needed)
hsb2 <- read.csv("http://www.ats.ucla.edu/stat/data/hsb2.csv")
varlist <- names(hsb2)[8:11]
Fit all models: construct list of formulas:
formList <- lapply(varlist,reformulate,response="read")
Construct models (a little bit of fancy footwork is needed to get the $call component to look right)
modList <- lapply(formList,
function(x) {
m <- lm(x,data=hsb2)
m$call$formula <- eval(m$call$formula)
return(m)
})
words <- c('Age','Sex', 'Residuals')
pick one of the fitted models:
lmobject <- modList[[1]]
construct formulas of the form . ~ . - w
minus_form <- function(w)
reformulate(c(".",paste0("-",w)),response=".")
minus_form("abc") ## . ~ . + -abc
Refit models with dropped terms:
newMods <- lapply(words,
function(w) {
update(lmobject,minus_form(w))
})

R. How to apply sapply() to random forest

I need to use a batch of models of RandomForest package. I decided to use a list list.of.models to store them. Now I don't know how to apply them. I append a list using
list.of.models <- append(list.of.models, randomForest(data, as.factor(label))
and then tried to use
sapply(list.of.models[length(list.of.models)], predict, data, type = "prob")
to call the last one but the problem is that randomForest returns a list of many values, not a learner.
What to do to add to list RF-model and then call it? For example lets take a source code
data(iris)
set.seed(111)
ind <- sample(2, nrow(iris), replace = TRUE, prob=c(0.8, 0.2))
iris.rf <- randomForest(Species ~ ., data=iris[ind == 1,])
iris.pred <- predict(iris.rf, iris[ind == 2,])
With append your're extending your model.list with the elements inside the RF.model, thus not recognized by predict.randomForest etc. because the outer container list and its attribute class="randomForest" are lost. Also data input is named newdata in predict.randomForest.
This should work:
set.seed(1234)
library(randomForest)
data(iris)
test = sample(150,25)
#create 3 models
RF.models = lapply(1:3,function(mtry) {
randomForest(formula=Species~.,data=iris[-test,],mtry=mtry)
})
#append extra model
RF.models[[length(RF.models)+1]] = randomForest(formula=Species~.,data=iris[-test,],mtry=4)
summary(RF.models)
#predict all models
sapply(RF.models,predict,newdata=iris[test,])
#predict one model
predict(RF.models[[length(RF.models)]],newdata=iris[test,])

Resources