I'm running various modeling algorithms on a data set. I've had best results by modeling my input variables to my responses one at a time, e.g.:
model <- train(y ~ x1 + x2 + ... + xn, ...)
Once I train my models, I'd like to not re-run them each time, so I've been trying to save them as .rda files. Here's an example loop for a random forest model (feel free to suggest a better way than a loop!):
# data_resp contains my measured responses, one per column
# data_pred contains my predictors, one per column
for (i in 1:ncol(data_resp)) {
model <- train(data_pred_scale[!is.na(data_resp[, i]), ],
data_resp[!is.na(data_resp[, i]), i],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
save(model, file = paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""))
When I load the model, however, it's going to be called model.
I haven't found a good way to save the model with it's corresponding name to try and refer back to it later. I found that one can assign an object to a string like so:
assign(paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""), train(...))
But I'm still left with how to refer to the object when I save it:
save(???, file = ...)
I don't know how to call the object by it's custom name.
Lastly, even loading has presented a problem. I've tried assign("model_name", load("./model.rda")), but the resultant object, called string ends up just holding a string of the object name, "model".
In looking around, I found THIS question, which seems relevant, but I'm trying to figure out how to apply it to my situation.
I could create a list with the names of each column name in data_resp (my measured responses) and then use lapply to use train(), but I'm still a bit stuck on how to dynamically refer to the new object name to keep the resultant model in.
When you save the model, save another object called 'name' which is a character string of the thing you want to name it as:
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m1"
> save(model,name,file="save1.rda")
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m2"
> save(model,name,file="save2.rda")
Now each file knows what it wants its resulting object to be called. How do you get that back on load? Load into a new environment, and assign:
> e=new.env()
> load("save1.rda",env=e)
> assign(e$name,e$model)
> summary(m1)
Call:
lm(formula = y ~ x, data = d)
You can now safely rm or re-use the 'e' object. You can of course wrap this in a function:
> blargh=function(f){e=new.env();load(f,env=e);assign(e$name,e$model,.GlobalEnv)}
> blargh("save2.rda")
> m2
Call:
lm(formula = y ~ x, data = d)
Note this is a double bad thing to do - firstly, you should probably store all the models in one file as a list with names. Secondly, this function has side effects, and if you had an object called m2 already it would get stomped on.
Using assign like this is nearly always a sign (dyswidt?) that you should use a list instead.
B
There is a fair amount of guesswork involved in this answer but I think this could help:
# get a vector with the column names in data_resp
modNames <- colnames( data_resp )
# create empty list
models <- as.list( NULL )
# iterate through your columns and assign the result as list members
for( n in modNames )
{
models[[n]] <- train(data_pred_scale[!is.na(data_resp[, n]), ], ### this may need modification, can't test without data
data_resp[!is.na(data_resp[, n]), n],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
}
# save the whole bunch
save( models, file = "models.rda" )
You can now retrieve, just with load( "models.rda ), this one object, the list with all your models, and address them with list notation, either as models[[1]] or with the column name, eg. models[["first"]].
I think the other answers about doing this with a loop are great. I used this as a chance to finally try and understand lapply better, as many of the StackOverflow questions about how to do this ended up suggesting the use of lists and lapply instead of loops.
I really like the idea of combining all results of train() into a list (which #vaettchen did in his loop), and in thinking about how to do this with a list, this is what I came up with. First, I needed my data.frame in list form, one entry per column. Since I don't really work with lists, I hunted around until just trying as.list(df), which worked like a charm.
Next, I want to apply my train function to each element of my list of measured response variables, so I defined the function like this:
# predictors are stored in data_pred
# responses are in data_resp (one per column)
# rows in data_pred/data_resp (perhaps obviously) match, one per observation
train_func <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3:6),
ntrees = 500) }
Now I just need to use lapply to apply the train() call on each element of data_resp. I didn't know how to create an empty, placeholder list, so thanks to #vaettchen for that (I was trying list_name <- list() without success):
models <- lapply(as.list(data_resp), train_func)
Awesomely, I found that models has it's elements automatically named to my column names in data_resp, which is just fantastic. I'm using this in conjunction with the shiny package, so this will make it incredibly easy for the user to select a response variable from a drop down (which can store the response variable name) and do:
predict(models[["resp_name"]], new_data)
I think this is much better than the loop based approach and everything just happened to fall in place nicely. I realize the question explicitly asked for naming variables programmatically, so apologies if that pushed others to answer in that fashion vs. a "bigger picture" answer. The ease of lapply suggests I was trying to force a particular solution when a (at least to my eyes) much better one existed.
Bonus: I didn't realize lists could be multi-dimensional, but in trying it, it appears they can be! This is even better, as I'm using numerous algorithms and I can store everything in one big list object.
func_rf <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3),
ntrees = 100) }
# svmRadial method requires formula syntax to work with factors,
# so the train function has to be a bit different
# add `scale = F` since I had to preProcess the numeric vars ahead of time
# and cbind to the factors. Without it, caret will try to scale the data
# for you, which fails for factors
func_svm <- function(y) { train(y ~ ., cbind(data_pred, y),
method = "svmRadial", tuneGrid = data.frame(.C = 1, .sigma = .2),
scale = F) }
model_list <- list(NULL)
model_list$rf <- lapply(as.list(data_resp), func_rf)
model_list$svm <- lapply(as.list(data_resp), func_svm)
Now I can refer the desired model and response variable with list syntax!
predict(model_list[["svm"]][["response_variable"]], new_data)
Super happy with this and hopefully it makes the code more efficient, faster, and I really love the "meta-object" I end up with vs. a ton of files, one per model/response variable combination, that I have to load in one at a time later on.
A bit of an old question but still without an accepted answer.
As I understand, you need to programmatically rename a variable and save it so that when reloaded it keeps the new name.
Try this:
saveWithName = function(var.name, obj){
# var.name is a string with the name of the variable you want to assign
# obj is any kind of R object (data.frame, list, etc.) you want to rename and save
assign(var.name, obj)
save(list=var.name, file=sprintf("model_%s.RData", var.name))
}
saveWithName("lab1", c(1,2))
saveWithName("lab2", c(3,4))
load("model_lab1.RData")
load("model_lab2.RData")
print(lab1)
#>[1] 1 2
print(lab2)
#[1] 3 4
Related
I have a large dataframe with many factor class features that I am attempting to one hot encode. I'm attempting to use the dummyVars function from the caret package to do this. My issue is that since I have a large dataframe, I cannot OHE all of them at once. Here is the solution that I have come up with:
fac <- data.frame()
for (i in names(train.fact)) {
dmy <- dummyVars( ~ i , data = train.fact)
trsf <- data.frame(predict(dmy, newdata = train.fact))
fac <- cbind(fac, trsf)
}
My hope is that this for loop would OHE the first feature, store that in fac, then move onto the next feature, OHE it, and cbind that information to fac and so on.
When attempting to run this, I get this error:
Error in `[.data.frame`(data, , vars, drop = FALSE) : undefined columns selected
I believe this is due to the way the names of each feature are being passed into 'i'.
I also thought this may be done through an apply function but cannot come up with the appropriate syntax.
Any help is appreciated!
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
I am trying to create (vector) objects in R. Thereby, I want to achieve that I don't specify a priori the name of the object. For example if I have a list of length 3, I want to create the objects p1 to p3 and if I have a list of length 10, the objects p1to p10 have to be created. The length should be arbitrary and not a priori determined.
Thanks for your help!
I guess the proper way of doing that is to consider a list p = list() and then you can use p[[i]] with i as big as you wish without having specified any length.
Then once your list is filled up, you can rename it: names(p) = paste0("p",c(1:length(p)))
Finally, if you want to get all the pi variables directly accessible, you add attach(p)
This is kind of a hack but you can do the following
short_list <- list(rnorm(10),rnorm(20),1:3)
long_list <- c(short_list,short_list )
paste0("p",seq_along(short_list))
mapply(assign, paste0("p",seq_along(short_list)), short_list, MoreArgs = list(envir = .GlobalEnv))
result:
> p3
[1] 1 2 3
you can do the same with long_list
I dont see a statistical model you will need this. Better start working with lists like short_list or data.frame's directly.
PS If you just want to use it for glm you probably want to learn formula's in R.
glm(y~., data=your_data) takes all columns in your data-frame that are not named y as regressor. Maybe this helps.
assign (and maybe also attach) are often a sign that you have not yet arrived at an "Rish" version of the code.
Considering that you need this for modeling: if your $p_1 \cdot p_n$ are of the same type, you can put them into a matrix (inside a column of a data.frame; for modeling they anyways need to be of same length):
df$matrix <- p.matrix
If you directly create the data.frame, you need to make sure the matrix is not expanded to data.frame columns:
df <- data.frame (matrix = I (matrix), ...)
Then glm (y ~ matrix, ...) will work.
For examples of this technique see e.g. packages pls or hyperSpec or the pls paper in the Journal of Statistical Software.
I have an list of models and want to return an array (not a list) of their coefficients. (For the curious, I am running a single model on data from a bunch of different neurons. I would like an array that is coefficients X neurons.) The following works fine if all the models run successfully:
Coefs = sapply(ModelList, coef)
But if one of the models fails, then coef() returns 'NULL', which is a different length from the other return values, and I end up with a list instead of an array. :(
My solution is works and is general purpose, but is horribly clumsy:
Coefs = sapply(ModelList, coef)
typical = Coefs[[1]] # (ought to ensure that this is not NULL!)
typical[1:length(typical)] = NA # Replace all coefficients with NA
Bad = sapply(ModelList, is.null) # Find the bad entries
for (i in which(Bad)) # For each 'NULL', (UGH! A LOOP!)
Coefs[[i]] = typical # replace with a proper entry (of NAs)
Coefs = simplify2array(Coefs) # Now I can convert it to an array
Is there a better solution?
Thanks!
larry
Still a little clumsy:
sapply(ModelList, function(x) ifelse(is.null(coef(x)), NA, coef(x))
One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)