Related
I have obtained a data set with slightly inconsistent and messy variable names. I would like to rename them in a an efficient and automated way.
I have a set of data frames and I need to rename some columns in several of them. The order of the columns, and length of the data frames differ, so I would like to use any function such as grep() or a subset term (df$x[== "term"]).
I have found an older question regarding this problem (Rename columns in multiple dataframes, R), but I haven't been able to get any of the suggested solutions to work since I get an error message. I do not have reputation enough to comment and ask further questions on those replies. However, my problem seems to be a bit different as I get an error message from my for loop that is not mentioned in the earlier question:
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
Setup: multiple data frames, let's call them myDF1, myDF2 ...
In those data frames there are columns with names (bad_name1, bad_name2) that should be changed to something else (good_name1, good_name2).
Replicable dataset:
myDF1 <- data.frame(bad_name1="A", bad_name2="B")
myDF2 <- data.frame(bad_name1="C", bad_name2="D")
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name0", replacement = "good_name1")
}
There are several ways of doing this. One that appealed to me is the subset method:
colnames(myDF1)[names(myDF1) == "bad_name1"] <- "good_name1")
This works fine as a single line, but not as a for loop.
for (x in c(myDF1,myDF2)) {
colnames(x)[colnames(x) == "bad_name1"] <- "good_name1"
}
Which renders the error message.
Error in `colnames<-`(`*tmp*`, value = character(0)) :
attempt to set 'colnames' on an object with less than two dimensions
The same error message applies with a 'gsub'-based method:
for (x in c(myDF1,myDF2)) {
colnames(x) <- gsub(x = colnames(x), pattern = "bad_name1", replacement = "good_name1")
}
I realise that I miss out on something fundamental here. I suppose that the for loop is not receiving the results of the 'colnames(x)' in an appropriate format. But I cannot understand how I'm supposed to make it work. The methods suggested in Rename columns in multiple dataframes, R does not really cover this error message.
Additional clarification, as asked for by vaettchen in a comment:
There is 3 column names that have to be changed (in all data frames). The reason is that they have names like varX.1, varX.2, varX.3 while I would prefer varXcount, varXmean, varXmax. So I have realised that there are names that I am not happy with, and decided on new ones based on my own taste.
You just need a few minor changes. Look at c(myDF1, myDF2) to see why that is not working - it splits the data frames into a list of 4 factors. Combine the data frames into a list and process the list:
all <- list(myDF1=myDF1, myDF2=myDF2)
for (x in seq_along(all)) {
colnames(all[[x]]) <- gsub(x = colnames(all[[x]]), pattern = "bad_name1",
replacement = "good_name1")
}
list2env(all, envir=.GlobalEnv)
I'm trying to write a function I can apply to a string vector or list instead of writing a loop. My goal is to run a regression for different endogenous variables and save the resulting tables. Since experienced R users tell us we should learn the apply functions, I want to give it a try. Here is my attempt:
Broken Example:
library(ExtremeBounds)
Data <- data.frame(var1=rbinom(30,1,0.2),var2=rbinom(30,1,0.2),var3=rnorm(30),var4=rnorm(30),var5=rnorm(30),var6=rnorm(30))
spec1 <- list(y=c("var1"),freevars=("var3"),doubtvars=c("var4","var5"))
spec2 <- list(y=c("var2"),freevars=("var4"),doubtvars=c("var3","var5","var6"))
specs <- c("spec1","spec2")
myfunction <- function(x){
eba <- eba(data=Data, y=x$y,
free=x$freevars,
doubtful=x$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, se.fun = se.robust, weights = "lri", family = binomial(logit))
output <- eba$bounds
output <- output[,-(3:7)]
}
lapply(specs,myfunction)
Which gives me an error that makes me guess that R does not understand when x should be "spec1" or "spec2". Also, I don't quite understand what lapply would try to collect here. Could you provide me with some best practice/hints how to communicate such things to R?
error: Error in x$y : $ operator is invalid for atomic vectors
Working example:
Here is a working example for spec1 without using apply that shows what I'm trying to do. I want to loop this example through 7 specs but I'm trying to get away from loops. The output does not have to be saved as a csv, a list of all outputs or any other collection would be great!
eba <- eba(data=Data, y=spec1$y,
free=spec1$freevars,
doubtful=spec1$doubtvars,
reg.fun=glm, k=1, vif=7, draws=50, se.fun = se.robust, weights = "lri", family = binomial(logit))
output <- eba$bounds
output <- output[,-(3:7)]
write.csv(output, "./Results/eba_pmr.csv")
Following the comments of #user20650, the solution is quite simple:
In the lapply command, use lapply(mget(specs),myfunction) which gets the names of the list elements of specs instead of the lists themselves.
Alternatively, one could define specs as a list: specs <- list(spec1,spec2) but that has the downside that the lapply command will return a list where the different specifications are numbered. The first version keeps the names of the specifications (spec1 and spec2) which which makes work with the resulting list much easier.
I have the following block of code. I am a complete beginner in R (a few days old) so I am not sure how much of the code will I need to share to counter my problem. So here is all of it I have written.
mdata <- read.csv("outcome-of-care-measures.csv",colClasses = "character")
allstate <- unique(mdata$State)
allstate <- allstate[order(allstate)]
spldata <- split(mdata,mdata$State)
if (num=="best") num <- 1
ranklist <- data.frame("hospital" = character(),"state" = character())
for (i in seq_len(length(allstate))) {
if (outcome=="heart attack"){
pdata <- spldata[[i]]
pdata[,11] <- as.numeric(pdata[,11])
bestof <- pdata[!is.na(as.numeric(pdata[,11])),][]
inorder <- order(bestof[,11],bestof[,2])
if (num=="worst") num <- nrow(bestof)
hospital <- bestof[inorder[num],2]
state <- allstate[i]
ranklist <- rbind(ranklist,c(hospital,state))
}
}
allstate is a character vector of states.
outcome can have values similar to "heart attack"
num will be numeric or "best" or "worst"
I want to create a data frame ranklist which will have hospital names and the state names which follow a certain criterion.
However I keep getting the error
invalid factor level, NA generated
I know it has something to do with rbind but I cannot figure out what is it. I have tried googling about this, and also tried troubleshooting using other similar queries on this site too. I have checked any of my vectors I am trying to bind are not factors. I also tried forcing the coercion by setting the hospital and state as.character() during assignment, but didn't work.
I would be grateful for any help.
Thanks in advance!
Since this is apparently from a Coursera assignment I am not going to give you a solution but I am going to hint at it: Have a look at the help pages for read.csv and data.frame. Both have the argument stringsAsFactors. What is the default, true or false? Do you want to keep the default setting? Is colClasses = "character" in line 1 necessary? Use the str function to check what the classes of the columns in mdata and ranklist are. read.csv additionally has an na.strings argument. If you use it correctly, also the NAs introduced by coercion warning will disappear and line 16 won't be necessary.
Finally, don't grow a matrix or data frame inside a loop if you know the final size beforehand. Initialize it with the correct dimensions (here 52 x 2) and assign e.g. the i-th hospital to the i-th row and first column of the data frame. That way rbind is not necessary.
By the way you did not get an error but a warning. R didn't interrupt the loop it just let you know that some values have been coerced to NA. You can also simplify the seq_len statement by using seq_along instead.
I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.
I'm running various modeling algorithms on a data set. I've had best results by modeling my input variables to my responses one at a time, e.g.:
model <- train(y ~ x1 + x2 + ... + xn, ...)
Once I train my models, I'd like to not re-run them each time, so I've been trying to save them as .rda files. Here's an example loop for a random forest model (feel free to suggest a better way than a loop!):
# data_resp contains my measured responses, one per column
# data_pred contains my predictors, one per column
for (i in 1:ncol(data_resp)) {
model <- train(data_pred_scale[!is.na(data_resp[, i]), ],
data_resp[!is.na(data_resp[, i]), i],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
save(model, file = paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""))
When I load the model, however, it's going to be called model.
I haven't found a good way to save the model with it's corresponding name to try and refer back to it later. I found that one can assign an object to a string like so:
assign(paste("./models/model_rf_", names(data_resp)[i], ".rda", sep = ""), train(...))
But I'm still left with how to refer to the object when I save it:
save(???, file = ...)
I don't know how to call the object by it's custom name.
Lastly, even loading has presented a problem. I've tried assign("model_name", load("./model.rda")), but the resultant object, called string ends up just holding a string of the object name, "model".
In looking around, I found THIS question, which seems relevant, but I'm trying to figure out how to apply it to my situation.
I could create a list with the names of each column name in data_resp (my measured responses) and then use lapply to use train(), but I'm still a bit stuck on how to dynamically refer to the new object name to keep the resultant model in.
When you save the model, save another object called 'name' which is a character string of the thing you want to name it as:
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m1"
> save(model,name,file="save1.rda")
> d=data.frame(x=1:10,y=rnorm(10))
> model=lm(y~x,data=d)
> name="m2"
> save(model,name,file="save2.rda")
Now each file knows what it wants its resulting object to be called. How do you get that back on load? Load into a new environment, and assign:
> e=new.env()
> load("save1.rda",env=e)
> assign(e$name,e$model)
> summary(m1)
Call:
lm(formula = y ~ x, data = d)
You can now safely rm or re-use the 'e' object. You can of course wrap this in a function:
> blargh=function(f){e=new.env();load(f,env=e);assign(e$name,e$model,.GlobalEnv)}
> blargh("save2.rda")
> m2
Call:
lm(formula = y ~ x, data = d)
Note this is a double bad thing to do - firstly, you should probably store all the models in one file as a list with names. Secondly, this function has side effects, and if you had an object called m2 already it would get stomped on.
Using assign like this is nearly always a sign (dyswidt?) that you should use a list instead.
B
There is a fair amount of guesswork involved in this answer but I think this could help:
# get a vector with the column names in data_resp
modNames <- colnames( data_resp )
# create empty list
models <- as.list( NULL )
# iterate through your columns and assign the result as list members
for( n in modNames )
{
models[[n]] <- train(data_pred_scale[!is.na(data_resp[, n]), ], ### this may need modification, can't test without data
data_resp[!is.na(data_resp[, n]), n],
method = "rf",
tuneGrid = data.frame(.mtry = c(3:6)),
nodesize = 3,
ntrees = 500)
}
# save the whole bunch
save( models, file = "models.rda" )
You can now retrieve, just with load( "models.rda ), this one object, the list with all your models, and address them with list notation, either as models[[1]] or with the column name, eg. models[["first"]].
I think the other answers about doing this with a loop are great. I used this as a chance to finally try and understand lapply better, as many of the StackOverflow questions about how to do this ended up suggesting the use of lists and lapply instead of loops.
I really like the idea of combining all results of train() into a list (which #vaettchen did in his loop), and in thinking about how to do this with a list, this is what I came up with. First, I needed my data.frame in list form, one entry per column. Since I don't really work with lists, I hunted around until just trying as.list(df), which worked like a charm.
Next, I want to apply my train function to each element of my list of measured response variables, so I defined the function like this:
# predictors are stored in data_pred
# responses are in data_resp (one per column)
# rows in data_pred/data_resp (perhaps obviously) match, one per observation
train_func <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3:6),
ntrees = 500) }
Now I just need to use lapply to apply the train() call on each element of data_resp. I didn't know how to create an empty, placeholder list, so thanks to #vaettchen for that (I was trying list_name <- list() without success):
models <- lapply(as.list(data_resp), train_func)
Awesomely, I found that models has it's elements automatically named to my column names in data_resp, which is just fantastic. I'm using this in conjunction with the shiny package, so this will make it incredibly easy for the user to select a response variable from a drop down (which can store the response variable name) and do:
predict(models[["resp_name"]], new_data)
I think this is much better than the loop based approach and everything just happened to fall in place nicely. I realize the question explicitly asked for naming variables programmatically, so apologies if that pushed others to answer in that fashion vs. a "bigger picture" answer. The ease of lapply suggests I was trying to force a particular solution when a (at least to my eyes) much better one existed.
Bonus: I didn't realize lists could be multi-dimensional, but in trying it, it appears they can be! This is even better, as I'm using numerous algorithms and I can store everything in one big list object.
func_rf <- function(y) { train(x = data_pred, y = y,
method = "rf", tuneGrid = data.frame(.mtry = 3),
ntrees = 100) }
# svmRadial method requires formula syntax to work with factors,
# so the train function has to be a bit different
# add `scale = F` since I had to preProcess the numeric vars ahead of time
# and cbind to the factors. Without it, caret will try to scale the data
# for you, which fails for factors
func_svm <- function(y) { train(y ~ ., cbind(data_pred, y),
method = "svmRadial", tuneGrid = data.frame(.C = 1, .sigma = .2),
scale = F) }
model_list <- list(NULL)
model_list$rf <- lapply(as.list(data_resp), func_rf)
model_list$svm <- lapply(as.list(data_resp), func_svm)
Now I can refer the desired model and response variable with list syntax!
predict(model_list[["svm"]][["response_variable"]], new_data)
Super happy with this and hopefully it makes the code more efficient, faster, and I really love the "meta-object" I end up with vs. a ton of files, one per model/response variable combination, that I have to load in one at a time later on.
A bit of an old question but still without an accepted answer.
As I understand, you need to programmatically rename a variable and save it so that when reloaded it keeps the new name.
Try this:
saveWithName = function(var.name, obj){
# var.name is a string with the name of the variable you want to assign
# obj is any kind of R object (data.frame, list, etc.) you want to rename and save
assign(var.name, obj)
save(list=var.name, file=sprintf("model_%s.RData", var.name))
}
saveWithName("lab1", c(1,2))
saveWithName("lab2", c(3,4))
load("model_lab1.RData")
load("model_lab2.RData")
print(lab1)
#>[1] 1 2
print(lab2)
#[1] 3 4