Related
I have a dataset "res.sav" that I read in via haven. It contains 20 columns, called "Genes1_Acc4", "Genes2_Acc4" etc. I am trying to find a correlation coefficient between those and another column called "Condition". I want to separately list all coefficients.
I created two functions, cor.condition.cols and cor.func to do that. The first iterates through the filenames and works just fine. The second was supposed to give me my correlations which didn't work at all. I also created a new "cor.condition.Genes" which I would like to fill with the correlations, ideally as a matrix or dataframe.
I have tried to iterate through the columns with two functions. However, when I try to pass it, I get the error: "NAs introduced by conversion". This wouldn't be the end of the world (I tried also suppressWarning()). But the bigger problem I have that it seems like my function does not convert said columns into the numeric type I need for my cor() function. I receive the "y must be numeric" error when trying to run the cor() function. I tried to put several arguments within and without '' or "" without success.
When I ran str(cor.condition.cols) I only receive character strings, which makes me think that my function somehow messes up with the as.numeric function. Any suggestions of how else I could iter through these columns and transfer them?
Thanks guys :)
cor.condition.cols <- lapply(1:20, function(x){paste0("res$Genes", x, "_Acc4")})
#save acc_4 columns as numeric columns and calculate correlations
res <- (as.numeric("cor.condition.cols"))
cor.func <- function(x){
cor(res$Condition, x, use="complete.obs", method="pearson")
}
cor.condition.Genes <- cor.func(cor.condition.cols)
You can do:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
res2 <- as.numeric(as.matrix(res[cor.condition.cols]))
cor.condition.Genes <- cor(res2, res$Condition, use="complete.obs", method="pearson")
eventually the short variant:
cor.condition.cols <- paste0("Genes", 1:20, "_Acc4")
cor.condition.Genes <- cor(res[cor.condition.cols], res$Condition, use="complete.obs")
Here is an example with other data:
cor(iris[-(4:5)], iris[[4]])
I am trying to set up a function that will run lm() on a model derived from a user defined matrix in R.
The modelMatrix would be set up by the user, but would be expected to be structured as follows:
source target
1 "PVC" "AA"
2 "Aro" "AA"
3 "PVC" "Aro"
This matrix serves to allow the user to define the dependent(target) and independent(source) variables in the lm and refers to the column names of another valuesMatrix:
PVC Ar AA
[1,] -2.677875504 0.76141471 0.006114699
[2,] 0.330537781 -0.18462039 -0.265710261
[3,] 0.609826160 -0.62470233 0.715474554
I need to take this modelMatrix and generate a relevant number of lms. Eg in this case:
lm(AA ~ PVC + Aro)
and
lm(Aro ~ PVC)
I tried this, but it would seem that as the user changes the model matrix, it becomes erroneous and I need to explicitly specify each independent variable according to the modelMatrix.
```lm(as.formula(paste(unique(modelMatrix[,"target"])[1], "~ .",sep="")),
data=data.frame(valuesMatrix))
```
Do I need to set up 2 loops (1 nested) to fetch the source and target strings and paste them into the formula or am I overlooking some detail. Very confused.
Ideally I would like the user to be able to change the modelMatrix to include 1 or many lms and one or many independent variables for each lm. I would really appreciate your assistance as I am really hitting a wall here.
Thanks.
For your specific example this code should work -
source <- c("PVC","Aro","PVC")
target <- c("AA","AA","Aro")
modelMatrix <- data.frame(source = source, target = target)
valuesMatrix <- as.matrix(rbind(c(-2.677875504,0.76141471,0.006114699), c(0.330537781,-0.18462039,-0.265710261),
c(0.609826160,-0.62470233,0.715474554)))
colnames(valuesMatrix) <- c("PVC","Aro","AA")
unique.target <- as.character(unique(modelMatrix$target))
lm.models <- lapply(unique.target, function(x) {lm(as.formula(paste(x,"~ .", sep = "")),
data = data.frame(valuesMatrix[,colnames(valuesMatrix) %in%
c(x,as.character(modelMatrix$source[which(modelMatrix$target==x)]))]))})
You can use the ideas of the for loop but for loops can be to expensive especially if your modelMatrix get really large. It might not look as appealing but the lapply function is optimized for this type of job. The only other trick to this was to keep the columns required to perform the lm.
You can pull the results of each lm also but using:
lm[[1]] and lm[[2]]
I'm working with a set of results of INLA package in R. These results are stored in objects with meaningful names so I can have, for instance, model_a, model_b... in current environment. For each of these models I'd like to do several processing tasks including extracting of the data to separate data frame, which can then be used to merge to spatial data to create map, etc.
Turning to simpler, reproducible example let's assume two results
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
model_a <- lm(weight ~ group)
model_b <- lm(weight ~ group - 1)
I can handle the steps for an individual model, for instance:
model_a_sum <- data.frame(var = character(1), model_a_value = numeric(1))
model_a_sum$var <- "Intercept"
model_a_sum$model_a_value <- model_a$coefficients[1]
png("model_a_plot.png")
plot(model_a, las = 1)
dev.off()
Now, I'd like to reuse this code for each of the models, essentially constructing correct names depending on the model I'm using. I'm more Stata than R person and inside Stata that would be a trivial task to use the stub of a name (model_a, or even a only..) and construct foreach loop that would implement all the steps, adapting names for each of the models.
In R, for loops have been bashed all over the internet so I presume I shouldn't attempt to venture into the territory of:
models <- c("model_a", "model_b", "model_c")
for (model in models) {
...
}
What would be the better solution for such scenario?
Update 1: Since comments suggested that for might indeed be an option I'm trying to put all the tasks into a loop. So far I manged to name the data frame correctly using assign and get correct data plotted under correct name using get:
models <- c("model_a", "model_b")
for (i in 1:length(models)) {
# create df
name.df <- paste0(models[i], "_sum")
assign(name.df, data.frame(var = character(1), value = numeric(1)))
# replace variables of df with results from the model
# plot and save
name.plot <- paste0(models[i], "_plot.png")
png(name.plot)
plot(get(models[i]), which = 1, las = 1)
dev.off()
}
Is this reasonable approach? Any better solutions?
One thing I cannot solve is having the second variable of the df named according to the model (ie. model_a_value instead of current value. Any ideas how to solve that?
Some general tips/advice:
As mentioned in comments, don't believe much of the negativity about for loops in R. The issue is not that they are bad, but more that they are correlated with some bad code patterns that are inefficient.
More important is to use the right data organization. Don't keep the models each in a separate object!. Put them in a list:
l <- vector("list",3)
l[[1]] <- lm(...)
l[[2]] <- lm(...)
l[[3]] <- lm(...)
Then name the list:
names(l) <- paste0("model_",letters[1:3])
Now you can loop over the list without resorting to awkward and unnecessary tools like assign and get, and more importantly when you're ready to step up from for loops to tools like lapply you're all good to go.
I would use similar strategies for your data frames as well.
See #joran answer, this one is to show use of assign and get but should be avoided when possible.
I would go this way for the for loop:
for (model in models) {
m <- get(model) # to get the real model object
# create the model_?_sum dataframe
assign(paste0(model,"_sum"), data.frame(var = "Intercept", value = m$coefficients[1]))
assign(paste0(model,"_sum"), setNames( get(paste0(model,"_sum")), c("var",paste0(model,"_value"))) ) # per comment to rename the value column thanks to #Franck in chat for the guidance
# paste0 to create the text
png(paste0(model,"_plot.png"))
plot(m, las = 1) # use the m object to graph
dev.off()
}
which give the two images and this:
> model_a_sum
var value
(Intercept) Integer 5.032
> model_b_sum
var value
groupCtl Integer 5.032
>
I'm unsure of why you wish this dataframe, but I hope this give clues on how to makes variables names and how to access them.
Ran a bunch of regressions and now I am trying to collect their p values and put them into a vector.
x=summary(reg2)$coefficients[4,4] #p value from the first regression, p-val is in row 4, col 4
for (i in 3:1000){
currentreg=summary(paste("reg",i,sep=""))
assign(x,c(x,currentreg$coefficients[4,4]))
}
I also tried eval(parse(currentreg)) and eval(parse(summary(paste("reg",i,sep="")))) with no luck. I always have this problem with telling R "Hey don't treat this as a string, treat it as a variable" and vice versa.
While it would be better to store the objects in a list and loop over that, you're asking for get:
currentreg <- summary(get(paste("reg", i, sep="")))
If you had a list of objects, models <- list(reg2, reg3, reg4, ...). You can then loop over this list with sapply to achieve the desired result (looping, collecting the results into a vector):
x <- sapply(models, function(z) { summary(z)$coeficients[4,4] })
You can use
sapply(mget(ls(pattern = "^reg\\d+$")), function(x) summary(x)$coefficients[4,4])
to create a vector with all p-values.
I'm running various modeling algorithms on a data set. I've had best results by modeling my input variables to my responses one at a time, e.g.:
model <- train(y ~ x1 + x2 + ... + xn, ...)
Once I train my models, I'd like to not re-run them each time, so I've been trying to save them as .rda files. Here's an example loop for a random forest model (feel free to suggest a better way than a loop!):
# data_resp contains my measured responses, one per column
# data_pred contains my predictors, one per column
for (i in 1:ncol(data_resp)) {
model <- train(data_pred_scale[!is.na(data_resp[, i]), ],
data_resp[!is.na(data_resp[, i]), i],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
save(model, file = paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""))
When I load the model, however, it's going to be called model.
I haven't found a good way to save the model with it's corresponding name to try and refer back to it later. I found that one can assign an object to a string like so:
assign(paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""), train(...))
But I'm still left with how to refer to the object when I save it:
save(???, file = ...)
I don't know how to call the object by it's custom name.
Lastly, even loading has presented a problem. I've tried assign("model_name", load("./model.rda")), but the resultant object, called string ends up just holding a string of the object name, "model".
In looking around, I found THIS question, which seems relevant, but I'm trying to figure out how to apply it to my situation.
I could create a list with the names of each column name in data_resp (my measured responses) and then use lapply to use train(), but I'm still a bit stuck on how to dynamically refer to the new object name to keep the resultant model in.
When you save the model, save another object called 'name' which is a character string of the thing you want to name it as:
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m1"
> save(model,name,file="save1.rda")
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m2"
> save(model,name,file="save2.rda")
Now each file knows what it wants its resulting object to be called. How do you get that back on load? Load into a new environment, and assign:
> e=new.env()
> load("save1.rda",env=e)
> assign(e$name,e$model)
> summary(m1)
Call:
lm(formula = y ~ x, data = d)
You can now safely rm or re-use the 'e' object. You can of course wrap this in a function:
> blargh=function(f){e=new.env();load(f,env=e);assign(e$name,e$model,.GlobalEnv)}
> blargh("save2.rda")
> m2
Call:
lm(formula = y ~ x, data = d)
Note this is a double bad thing to do - firstly, you should probably store all the models in one file as a list with names. Secondly, this function has side effects, and if you had an object called m2 already it would get stomped on.
Using assign like this is nearly always a sign (dyswidt?) that you should use a list instead.
B
There is a fair amount of guesswork involved in this answer but I think this could help:
# get a vector with the column names in data_resp
modNames <- colnames( data_resp )
# create empty list
models <- as.list( NULL )
# iterate through your columns and assign the result as list members
for( n in modNames )
{
models[[n]] <- train(data_pred_scale[!is.na(data_resp[, n]), ], ### this may need modification, can't test without data
data_resp[!is.na(data_resp[, n]), n],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
}
# save the whole bunch
save( models, file = "models.rda" )
You can now retrieve, just with load( "models.rda ), this one object, the list with all your models, and address them with list notation, either as models[[1]] or with the column name, eg. models[["first"]].
I think the other answers about doing this with a loop are great. I used this as a chance to finally try and understand lapply better, as many of the StackOverflow questions about how to do this ended up suggesting the use of lists and lapply instead of loops.
I really like the idea of combining all results of train() into a list (which #vaettchen did in his loop), and in thinking about how to do this with a list, this is what I came up with. First, I needed my data.frame in list form, one entry per column. Since I don't really work with lists, I hunted around until just trying as.list(df), which worked like a charm.
Next, I want to apply my train function to each element of my list of measured response variables, so I defined the function like this:
# predictors are stored in data_pred
# responses are in data_resp (one per column)
# rows in data_pred/data_resp (perhaps obviously) match, one per observation
train_func <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3:6),
ntrees = 500) }
Now I just need to use lapply to apply the train() call on each element of data_resp. I didn't know how to create an empty, placeholder list, so thanks to #vaettchen for that (I was trying list_name <- list() without success):
models <- lapply(as.list(data_resp), train_func)
Awesomely, I found that models has it's elements automatically named to my column names in data_resp, which is just fantastic. I'm using this in conjunction with the shiny package, so this will make it incredibly easy for the user to select a response variable from a drop down (which can store the response variable name) and do:
predict(models[["resp_name"]], new_data)
I think this is much better than the loop based approach and everything just happened to fall in place nicely. I realize the question explicitly asked for naming variables programmatically, so apologies if that pushed others to answer in that fashion vs. a "bigger picture" answer. The ease of lapply suggests I was trying to force a particular solution when a (at least to my eyes) much better one existed.
Bonus: I didn't realize lists could be multi-dimensional, but in trying it, it appears they can be! This is even better, as I'm using numerous algorithms and I can store everything in one big list object.
func_rf <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3),
ntrees = 100) }
# svmRadial method requires formula syntax to work with factors,
# so the train function has to be a bit different
# add `scale = F` since I had to preProcess the numeric vars ahead of time
# and cbind to the factors. Without it, caret will try to scale the data
# for you, which fails for factors
func_svm <- function(y) { train(y ~ ., cbind(data_pred, y),
method = "svmRadial", tuneGrid = data.frame(.C = 1, .sigma = .2),
scale = F) }
model_list <- list(NULL)
model_list$rf <- lapply(as.list(data_resp), func_rf)
model_list$svm <- lapply(as.list(data_resp), func_svm)
Now I can refer the desired model and response variable with list syntax!
predict(model_list[["svm"]][["response_variable"]], new_data)
Super happy with this and hopefully it makes the code more efficient, faster, and I really love the "meta-object" I end up with vs. a ton of files, one per model/response variable combination, that I have to load in one at a time later on.
A bit of an old question but still without an accepted answer.
As I understand, you need to programmatically rename a variable and save it so that when reloaded it keeps the new name.
Try this:
saveWithName = function(var.name, obj){
# var.name is a string with the name of the variable you want to assign
# obj is any kind of R object (data.frame, list, etc.) you want to rename and save
assign(var.name, obj)
save(list=var.name, file=sprintf("model_%s.RData", var.name))
}
saveWithName("lab1", c(1,2))
saveWithName("lab2", c(3,4))
load("model_lab1.RData")
load("model_lab2.RData")
print(lab1)
#>[1] 1 2
print(lab2)
#[1] 3 4